____..--' .-''-. ,---. .--..-./`) ,---------. .---. .---. ,---. .--. .---. .-------.
| | .'_ _ \ | \ | |\ .-.')\ \| | |_ _| | \ | | | ,_| \ _(`)_ \
| .-' ' / ( ` ) '| , \ | |/ `-' \ `--. ,---'| | ( ' ) | , \ | |,-./ ) | (_ o._)|
|.-'.' /. (_ o _) || |\_ \| | `-'`"` | \ | '-(_{;}_) | |\_ \| |\ '_ '`) | (_,_) /
/ _/ | (_,_)___|| _( )_\ | .---. :_ _: | (_,_) | _( )_\ | > (_) ) | '-.-'
.'._( )_ ' \ .---.| (_ o _) | | | (_I_) | _ _--. | | (_ o _) |( . .-' | |
.' (_'o._) \ `-' /| (_,_)\ | | | (_(=)_) |( ' ) | | | (_,_)\ | `-'`-'|___ | |
| (_,_)| \ / | | | | | | (_I_) (_{;}_)| | | | | | | \/ )
|_________| `'-..-' '--' '--' '---' '---' '(_,_) '---' '--' '--' `--------``---'
ZenithNLP is an advanced, from-scratch NLP framework built with PyTorch for training, fine-tuning, and deploying modern transformer-based models. It serves as a comprehensive toolkit for NLP practitioners and researchers, featuring a modular architecture and a full suite of MLOps capabilities.
- β¨ Features
- π Getting Started
- π Tutorial: Training a Text Classifier
- ποΈ Framework Architecture
- π€ Contributing
- π License
- State-of-the-Art Model Architectures: From-scratch implementations of:
BERT(Encoder-only) for tasks like classification and NER.GPT(Decoder-only) for causal language modeling and text generation.Seq2SeqTransformer(Encoder-Decoder) for translation and summarization.
- Advanced Training Techniques:
- Parameter-Efficient Fine-Tuning (PEFT): Integrated LoRA (Low-Rank Adaptation) for efficient fine-tuning of large models.
- Distributed Training: Support for multi-GPU training using PyTorch's
DistributedDataParallel. - Advanced Optimization: Includes learning rate scheduling with warm-up and gradient clipping.
- Full MLOps Pipeline:
- Configuration Management: Powered by Hydra, allowing for flexible and reproducible experiments through YAML files.
- Experiment Tracking: Integrated with MLflow to log parameters, metrics, and model artifacts automatically.
- Containerization: Fully containerized with Docker and Docker Compose for reproducible environments and easy deployment of the MLflow UI.
- Continuous Integration: Automated testing pipeline with GitHub Actions and
pytest.
- Flexible API for Deployment:
- A ready-to-use FastAPI server that can dynamically load and serve any model trained with the framework.
- Custom Core Components:
- A trainable Byte-Pair Encoding (BPE) Tokenizer built from scratch.
- Modular implementations of
MultiHeadAttention,PositionalEncoding, and other core transformer building blocks.
Note: Once published, you will be able to install the framework directly from PyPI.
pip install zenith-nlp-framework# 1. Clone the repository
git clone https://github.com/cattolatte/zenith-nlp-framework.git
cd zenith-nlp-framework
# 2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# 3. Install all dependencies
pip install -r requirements.txt
# 4. Install the project in editable mode
pip install -e .This framework is designed for flexibility. Hereβs how you can train your own text classification model.
Place your training data (e.g., my_data.csv) in a local data/ directory. Use the configs/ directory as a template. You can modify config.yaml or create a new one to point to your data file and adjust model/training parameters.
Run the text classification task script. All parameters are managed by the Hydra configuration files in the configs/ directory.
# Run with default settings from the config files
python3 -m my_nlp_framework.tasks.text_classificationYou can easily override any parameter from the command line:
# Train for more epochs with a different learning rate
python3 -m my_nlp_framework.tasks.text_classification training.epochs=10 training.learning_rate=0.0005
# Train with LoRA enabled
python3 -m my_nlp_framework.tasks.text_classification model.use_lora=True model.lora_rank=8Before training, launch the MLflow UI to track your experiments in real-time. The docker-compose.yml file is pre-configured for you.
# Start the MLflow server in the background
docker-compose up -dNavigate to http://localhost:5000 in your browser to view the MLflow dashboard.
Once you have a trained model (.pth file) and tokenizer (.json file), you can easily deploy it with the built-in FastAPI server.
python3 -m my_nlp_framework.inference.api \
--model-path /path/to/your/trained_model.pth \
--tokenizer-path /path/to/your/tokenizer.json \
--vocab-size 10000 \
--num-classes 2The API will be available at http://localhost:8000/docs for interactive testing.
You can also run the entire training process within a Docker container for perfect reproducibility.
# 1. Build the Docker image
docker build -t zenith-nlp-framework:latest .
# 2. Run a task (mounting your local data directory)
docker run --rm -v "$(pwd)/data":/app/data zenith-nlp-framework:latest \
python -m my_nlp_framework.tasks.text_classificationThis framework is organized into several key modules:
src/my_nlp_framework/core: Contains the fundamental building blocks like attention mechanisms, LoRA layers, and tokenizers.src/my_nlp_framework/models: Defines high-level model architectures like BERT and GPT.src/my_nlp_framework/data: Includes flexible data loaders.src/my_nlp_framework/training: A powerful, centralized training engine with advanced features.src/my_nlp_framework/tasks: Example scripts that show how to use the framework to solve end-to-end problems.src/my_nlp_framework/inference: Code for deploying and serving trained models.configs/: Centralized YAML configuration files for Hydra.tests/: Unit and integration tests for the framework.
Contributions are welcome! Please feel free to submit a pull request or open an issue.
This project is licensed under the MIT License. See the LICENSE file for details.