This repository provides a comprehensive guide to trans-tokenization and fine-tuning Mistral-7B models. The primary objective is to achieve optimal performance in translating and training large language models (LLMs) across different languages.
- Overview
- Installation
- Trans-Tokenization
- Fine-Tuning
- Dataset
- Time Estimation
- Supported Tools
- Understanding Tokenization
- Contributing
- License
- Contact
This guide covers:
- Setting up the base model and tokenizer.
- Implementing trans-tokenization.
- Fine-tuning the Mistral7B-model.
- Understanding tokenization challenges with Latin and non-Latin scripts.
git clone https://github.com/FremyCompany/fast_align
mkdir build
cd build
cmake ..
make
cd .. # return to the original folder
conda create -n llama-env python=3.10
conda activate llama-env
pip install -r requirements.txt
To perform trans-tokenization, you will need two key files:
Process of transtokenizatoin - Use a Source Language Finetuned Model (eg - mistralai/Mistral-7B-Instruct-v0.3) (source langauge - english) Transtokenize it using a custom target language trained tokenizer gives you the - Resulting model (subhrokomol/hindi2) (target language - Hindi)
transtokenization.py
run.py
Example Setup in run.py:
source_model = "meta-llama/Meta-Llama-3-8B"
target_tokenizer = "yhavinga/gpt-neo-1.3B-dutch"
export_dir = "en-nl-llama3-8b"
corpus = create_aligned_corpus(
source_language="en",
target_language="nl",
source_tokenizer=source_model,
target_tokenizer=target_tokenizer,
)
You can view the list of supported languages in the CCMATRIX_MAPPING section of transtokenizers.py.
Select the dataset from the corpus_list:
corpus_list = ["allenai/nllb", ]
open_subtitles
allenai/nllb
After setting up the configurations, run the run.py script in your conda environment:
python run.py
This script will:
Import necessary functions from transtokenizers.py. Automatically iterate through the data to create the aligned corpus. Align tokens using Fast Align. Smooth and remap the model. The final output model will be saved in your specified export directory.
If you are short on storage, you can stop the script after the dataset download begins. This will create a new folder with the partially downloaded dataset. Upon re-running the script, it will handle the edge case and continue from the next step.
If the token mapping reaches 100%, there may be an issue with your code. Check the Moses file and the TSV file generated after the process for potential errors.
We achieved a translation accuracy of 87% on the Hindi dataset. You can fine-tune this model further to achieve better results.
Using Unsloth AI for PEFT Xformers work with torch 2.3 and above and cuda 12.1 and above - Refer to Unsloth documentation for installation - https://docs.unsloth.ai/get-started/installation/conda-install Head over to unsloth.ipynb to run the finetuning code -
You have to change the code that handles the dataset according to your specific needs, rest of the code remains the same. Dataset used here for finetuning - https://huggingface.co/datasets/apurvagup/ultrachat_hindi_seamless
If you want to pure PyTorch, Huggingface PEFT use using fame.py Setting Up
- Create a new conda environment:
conda create -n finetune-env python=3.10 conda activate finetune-env pip install -r requirements.txt
- Login to Hugging Face and Weights & Biases:
huggingface-cli login wandb login
BASE_MODEL_NAME = "subhrokomol/Meta-Llama-3-8B-Hindi"
DATASET_NAME = "zicsx/mC4-Hindi-Cleaned-3.0"
OUTPUT_DIR_NAME = "fine_tuned_llama"
HF_MODEL_NAME = "subhrokomol/fine-tuned-llama-3-hindi"
UPLOAD_TO_HF = True
SAVE_FULL_MODEL = True
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_NAME, device_map="auto", torch_dtype=torch.bfloat16
)
training_args = TrainingArguments( #chenage them as you prefer
output_dir=OUTPUT_DIR_NAME,
num_train_epochs=1,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
warmup_steps=500,
learning_rate=5e-4,
bf16=True,
logging_steps=50,
save_steps=1000,
eval_steps=500,
report_to="wandb",
run_name="llama-3-8b-finetuning",
)
- Start fine-tuning: python finetune.py
- The final model will be uploaded to your Hugging Face repository.
- Size: 17 GB
- Number of Rows: 4 Million
- Parameters: 8 Billion
Dataset Size | Model Parameters | Estimated Time |
---|---|---|
17 GB (3% of dataset) | 8B | 6 hours |
17 GB (7% of dataset) | 8B | 12 hours |
17 GB (50% of dataset) | 8B | 40 hours |
Note: The estimated times are based on training with the specified model parameters and may vary depending on hardware and other factors.
3% Dataset: ~6 hours for 1 epoch 7% Dataset: ~12 hours for 1 epoch 50% Dataset: ~40 hours (approx)
Tool | Link |
---|---|
Axolotl | GitHub |
Hugging Face PEFT | GitHub |
PyTorch Torchtune | GitHub |
Training a BPE SentencePiece tokenizer is straightforward. You can use the following example to convert the Hugging Face format:
def convert_to_hf_format(output_path):
transformers.LlamaTokenizerFast(vocab_file=output_path+'.model').save_pretrained(output_path+'/')
Challenges Latin scripts tend to have a higher percentage of tokens compared to non-Latin scripts. This can affect translation accuracy.
Image showing hindi and english tokenization with Llama 3 8B https://huggingface.co/spaces/yenniejun/tokenizers-languages#median-token-length-for-openai-gpt4
use train_tokenizer/train_tokenizer.py
for training directly with HuggingFace tokenizers on ByteLevelBPETokenizer on GPT-Neo Model (GPTNeoForCausalLM)
This code uses a wikimedia/wikipedia dataset from huggingface to train on Hindi and has the model is deployed on https://huggingface.co/subhrokomol/gpt-neo-1.3B-hindi
Use train_tokenizer/train_spm_tokenizer.py
to train using spm tokenizer. Head over to https://github.com/google/sentencepiece for Build and install SentencePiece command line tools from C++ source
and run the python file. (You need a dataset text file that contains half of your target language
and source language
.
I used https://github.com/shreyash321/Hindi_Benchmark_dataset for benchmarking the both subhrokomol/Mistral-7B-Instruct-v0.3-transtokenized
and mistralai/Mistral-7B-Instruct-v0.3
for calculating Perplexity score.
https://github.com/ray-project/llmperf is also a good alternative
🌟 Contributing We welcome contributions from the community! Please read the Contributing Guidelines for more information.
📜 License This project is licensed under the MIT License. See the LICENSE file for details.