Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

English Text Filtering Script

This notebook cleans raw text datasets by removing all non-English lines from text files in a specified input directory. It uses the langdetect library to identify the language of each line and stores only English text into a new output folder for further processing.


  1. Install dependencies:

    pip install langdetect tqdm
  2. Set input and output folders inside the notebook:

    input_folder = "1data/task1_train_files_2025"
    output_folder = "1data/processed_train_langonly"

    For test data:

    input_folder = "1data/task1_test_files_2025"
    output_folder = "1data/processed_test_langonly"
  3. Run the preprocessing notebook: Open and execute:

    0preprocessing/preprocess.ipynb
    

Note: This step was necessary because, without the English-only filtering, a step in embedding generation pipeline where french text was being converted to english text was taking much time or hanging due to multilingual/noisy content. So as we observed, the english translation for the french content was already available in the text, we simply removed the french contents from the texts.