Skip to content
This repository was archived by the owner on Nov 22, 2022. It is now read-only.

Fix featurizer & memory issues for char embeddings #220

Closed
wants to merge 1 commit into from

Conversation

mwu1993
Copy link
Contributor

@mwu1993 mwu1993 commented Jan 17, 2019

Summary:
This diff addresses a few issues for CharacterEmbeddings:

  • YodaFeaturizer, YodaFeaturizerLocal, and SimpleFeaturizer did not create characters that can be used by CharFeatureField - fix this.
  • CharFeatureField padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a max_word_length flag in the feature (default to 20).
  • There was no min_freq flag for characters (so every character appearing in training data would get an embedding ID). Add this flag.

Differential Revision: D13662817

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 17, 2019
mwu1993 pushed a commit to mwu1993/pytext-1 that referenced this pull request Jan 23, 2019
)

Summary:
Pull Request resolved: facebookresearch#220

This diff addresses a few issues for CharacterEmbeddings:

- `YodaFeaturizer`, `YodaFeaturizerLocal`, and `SimpleFeaturizer` did not create characters that can be used by `CharFeatureField` - fix this.
- `CharFeatureField` padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a `max_word_length` flag in the feature (default to 20).
- There was no `min_freq` flag for characters (so every character appearing in training data would get an embedding ID). Add this flag.

Reviewed By: jingfeidu

Differential Revision: D13662817

fbshipit-source-id: 9486c72ed8506186c6506765db80fd9f80386743
)

Summary:
Pull Request resolved: facebookresearch#220

This diff addresses a few issues for CharacterEmbeddings:

- `YodaFeaturizer`, `YodaFeaturizerLocal`, and `SimpleFeaturizer` did not create characters that can be used by `CharFeatureField` - fix this.
- `CharFeatureField` padded all tokens (in all sentences) to the max token length of each batch, which leads to OOMs (e.g. in hate speech there is a token of length 60k). Instead, add a `max_word_length` flag in the feature (default to 20).
- There was no `min_freq` flag for characters (so every character appearing in training data would get an embedding ID). Add this flag.

Reviewed By: jingfeidu

Differential Revision: D13662817

fbshipit-source-id: bc1aff52d904e7ddaed30cb08ddcd821fe714cd4
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants