cd allie/features/text_features
python3 featurize.py [folder] [featuretype]
- bert features - extract BERT-related features from sentences (note shorter sentences run faster here, and long text can lead to long featurization times).
- fast_features
- glove_features
- grammar_features - 85k+ grammar features (memory intensive)
- nltk_features - standard text feature array (default)
- spacy_features
- textacy_features - a variety of document classification and topic modeling features
- text_features - many different types of features like emotional word counts, total word counts, Honore's statistic and others.
- w2v_features - note this is the largest model from Google and may crash your computer if you don't have enough memory. I'd recommend fast_features if you're looking for a pre-trained embedding.
Here are some default settings relevant to this section of Allie's API:
setting | description | default setting | all options |
---|---|---|---|
default_text_features | default set of text features used for featurization (list). | ["nltk_features"] | ["bert_features", "fast_features", "glove_features", "grammar_features", "nltk_features", "spacy_features", "text_features", "w2v_features"] |
default_text_transcriber | the default transcription techniques used to parse raw .TXT files during model training | ["raw_text"] | ["raw_text"] |
transcribe_text | a setting to define whether or not to transcribe text files during featurization and model training via the default_image_transcriber | True | True, False |