kbc/0preprocessing at master · bhavya930636/kbc

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
preprocess.ipynb	preprocess.ipynb

Name

Last commit message

Last commit date

English Text Filtering Script

This notebook cleans raw text datasets by removing all non-English lines from text files in a specified input directory. It uses the langdetect library to identify the language of each line and stores only English text into a new output folder for further processing.

Install dependencies:
```
pip install langdetect tqdm
```

Set input and output folders inside the notebook:

input_folder = "1data/task1_train_files_2025"
output_folder = "1data/processed_train_langonly"

For test data:

input_folder = "1data/task1_test_files_2025"
output_folder = "1data/processed_test_langonly"

Run the preprocessing notebook: Open and execute:
```
0preprocessing/preprocess.ipynb
```

Note: This step was necessary because, without the English-only filtering, a step in embedding generation pipeline where french text was being converted to english text was taking much time or hanging due to multilingual/noisy content. So as we observed, the english translation for the french content was already available in the text, we simply removed the french contents from the texts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

English Text Filtering Script

FilesExpand file tree

0preprocessing

Directory actions

More options

Directory actions

More options

Latest commit

History

0preprocessing

Folders and files

parent directory

README.md

English Text Filtering Script