This project is an application designed for data processing and training a FastText model. It was developed as part of an academic project at a university and will not be further developed.
The application provides a complete set of tools for data preparation, cleaning, tokenization, lemmatization, and training a FastText model. The process also includes splitting the dataset into training and test sets, as well as exporting the trained model.
- Support for CSV and JSON files.
- Automatic data type conversion.
- Basic dataset statistics.
- Display of missing value information.
- Filling missing values using:
- Forward fill,
- Backward fill,
- Manual input.
- Removal of duplicates and unnecessary columns.
- Text data cleaning:
- Case normalization,
- Removal of excessive spaces,
- Removal of special characters and numbers.
- Text tokenization.
- Stop-word removal.
- Text lemmatization.
- Adding the
__label__
prefix to labels. - Converting tokens into text format.
- Splitting data into training and test sets with configurable proportions.
- Preview of the resulting data split.
- Customizable training parameters:
- Number of epochs,
- Learning rate,
- N-grams,
- Embedding dimension,
- Loss function.
- Training the model on user data.
- Display of training process information.
- Exporting the trained model.
- Testing the model on the test dataset.
- Calculating accuracy and classification performance.
- Ability to make predictions on new text data.
To run the application, install the required dependencies listed in requirements.txt
:
pip install -r requirements.txt
-
Clone the repository:
git clone https://github.com/baarteek/fastTextProject cd fastTextProject
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
python main.py
The application includes the following views:
- Data Loading
- Data Exploration
- Data Cleaning
- Text Processing
- Label Preparation
- Data Splitting
- Model Configuration
- Model Training
Screenshots are available in the docs/
directory.