This notebook cleans raw text datasets by removing all non-English lines from text files in a specified input directory.
It uses the langdetect library to identify the language of each line and stores only English text into a new output folder for further processing.
-
Install dependencies:
pip install langdetect tqdm
-
Set input and output folders inside the notebook:
input_folder = "1data/task1_train_files_2025" output_folder = "1data/processed_train_langonly"
For test data:
input_folder = "1data/task1_test_files_2025" output_folder = "1data/processed_test_langonly"
-
Run the preprocessing notebook: Open and execute:
0preprocessing/preprocess.ipynb
Note: This step was necessary because, without the English-only filtering, a step in embedding generation pipeline where french text was being converted to english text was taking much time or hanging due to multilingual/noisy content. So as we observed, the english translation for the french content was already available in the text, we simply removed the french contents from the texts.