Skip to content

Vocabulary size cannot be specified for word features #1452

@noe

Description

@noe

When using word features, it is useful to be able to limit the vocabulary size, especially when using lemmas as word features, as the number of different lemmas can be very large.

However, preprocess.py does not support providing a list of integers, unlike the Lua version.

I understand that it should be possible to get the same effect by extracting the feature vocabulary externally (pruning the vocabulary by frequency) and then supplying the vocabulary to preprocess.py with the parameter --features_vocabs_prefix.

I just realized that --features_vocabs_prefix is completely ignored. Therefore, I understand that there is no way to control the feature vocabulary size; is that correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions