Fix featurizer & memory issues for char embeddings #220

mwu1993 · 2019-01-17T19:25:20Z

Summary:
This diff addresses a few issues for CharacterEmbeddings:

YodaFeaturizer, YodaFeaturizerLocal, and SimpleFeaturizer did not create characters that can be used by CharFeatureField - fix this.
CharFeatureField padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a max_word_length flag in the feature (default to 20).
There was no min_freq flag for characters (so every character appearing in training data would get an embedding ID). Add this flag.

Differential Revision: D13662817

) Summary: Pull Request resolved: facebookresearch#220 This diff addresses a few issues for CharacterEmbeddings: - `YodaFeaturizer`, `YodaFeaturizerLocal`, and `SimpleFeaturizer` did not create characters that can be used by `CharFeatureField` - fix this. - `CharFeatureField` padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a `max_word_length` flag in the feature (default to 20). - There was no `min_freq` flag for characters (so every character appearing in training data would get an embedding ID). Add this flag. Reviewed By: jingfeidu Differential Revision: D13662817 fbshipit-source-id: 9486c72ed8506186c6506765db80fd9f80386743

) Summary: Pull Request resolved: facebookresearch#220 This diff addresses a few issues for CharacterEmbeddings: - `YodaFeaturizer`, `YodaFeaturizerLocal`, and `SimpleFeaturizer` did not create characters that can be used by `CharFeatureField` - fix this. - `CharFeatureField` padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a `max_word_length` flag in the feature (default to 20). - There was no `min_freq` flag for characters (so every character appearing in training data would get an embedding ID). Add this flag. Reviewed By: jingfeidu Differential Revision: D13662817 fbshipit-source-id: bc1aff52d904e7ddaed30cb08ddcd821fe714cd4

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 17, 2019

mwu1993 force-pushed the export-D13662817 branch from 569f9e1 to eeb98d1 Compare January 23, 2019 19:51

mwu1993 force-pushed the export-D13662817 branch from eeb98d1 to edfd89b Compare January 23, 2019 19:51

facebook-github-bot closed this in 4b77a98 Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix featurizer & memory issues for char embeddings #220

Fix featurizer & memory issues for char embeddings #220

mwu1993 commented Jan 17, 2019

Fix featurizer & memory issues for char embeddings #220

Fix featurizer & memory issues for char embeddings #220

Conversation

mwu1993 commented Jan 17, 2019