allie/features/text_features at 3dd099641dd1df2c86177111dc15312549da54a1 · jim-schwoebel/allie

readme.md

cd allie/features/text_features
python3 featurize.py [folder] [featuretype]

bert features - extract BERT-related features from sentences (note shorter sentences run faster here, and long text can lead to long featurization times).
fast_features
glove_features
grammar_features - 85k+ grammar features (memory intensive)
nltk_features - standard text feature array (default)
spacy_features
textacy_features - a variety of document classification and topic modeling features
text_features - many different types of features like emotional word counts, total word counts, Honore's statistic and others.
w2v_features - note this is the largest model from Google and may crash your computer if you don't have enough memory. I'd recommend fast_features if you're looking for a pre-trained embedding.

Here are some default settings relevant to this section of Allie's API:

setting	description	default setting	all options
default_text_features	default set of text features used for featurization (list).	["nltk_features"]	["bert_features", "fast_features", "glove_features", "grammar_features", "nltk_features", "spacy_features", "text_features", "w2v_features"]
default_text_transcriber	the default transcription techniques used to parse raw .TXT files during model training	["raw_text"]	["raw_text"]
transcribe_text	a setting to define whether or not to transcribe text files during featurization and model training via the default_image_transcriber	True	True, False