To clean an entire folder of .TXT files, you can run:
cd ~
cd allie/cleaning/text_cleaning
python3 cleaning.py /Users/jimschwoebel/allie/load_dir
- clean_summary - extracts a 100 word summary of a long piece of text and deletes the original work (using Text rank summarization)
- clean_textacy - removes punctuation and a variety of other operations to clean a text (uses Textacy)
Here are some default settings relevant to this section of Allie's API:
setting | description | default setting | all options |
---|---|---|---|
clean_data | whether or not to clean datasets during the model training process via default cleaning scripts. | False | True, False |
default_text_cleaners | the default cleaning techniques used during model training on text data if clean_data == True | ["clean_textacy"] | ["clean_summary", "clean_textacy"] |