Skip to content

Latest commit

 

History

History

text_features

How to use

cd allie/features/text_features
python3 featurize.py [folder] [featuretype]

Text

  • bert features - extract BERT-related features from sentences (note shorter sentences run faster here, and long text can lead to long featurization times).
  • fast_features
  • glove_features
  • grammar_features - 85k+ grammar features (memory intensive)
  • nltk_features - standard text feature array (default)
  • spacy_features
  • textacy_features - a variety of document classification and topic modeling features
  • text_features - many different types of features like emotional word counts, total word counts, Honore's statistic and others.
  • w2v_features - note this is the largest model from Google and may crash your computer if you don't have enough memory. I'd recommend fast_features if you're looking for a pre-trained embedding.

Settings

Here are some default settings relevant to this section of Allie's API:

setting description default setting all options
default_text_features default set of text features used for featurization (list). ["nltk_features"] ["bert_features", "fast_features", "glove_features", "grammar_features", "nltk_features", "spacy_features", "text_features", "w2v_features"]
default_text_transcriber the default transcription techniques used to parse raw .TXT files during model training ["raw_text"] ["raw_text"]
transcribe_text a setting to define whether or not to transcribe text files during featurization and model training via the default_image_transcriber True True, False