This is the official code for the paper titled "How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?"
For reproduction, please refer to Reproduction.
For models developed in the paper, please refer to Adapted Models.
- Python 3.12.4 or later
- CUDA 12.4
- torch
- transformers
- peft
- datasets
- evaluate
- bitsandbytes
- scikit-learn
- sentencepiece
- huggingface-hub
- lighteval
- openai
- tqdm
- pyarrow
- entmax
- fastdist
- rouge-score
- numba
- lighteval
- openai
- tiktoken
- BLEURT==0.0.2 (See below)
- fasttext==0.9.2 (See below)
After manually installing PyTorch
and transformers
, please run the following.
# fastText
pip install -r requirements.txt
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip install .
# BLEURT
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
Please see the preprocessing
directory for preprocessing.
Please see the initialization
directory for target model initialization.
Please see the lapt
directory for language adaptive pre-training.
Please see the eval
directory for evaluation.
All adapted models (168 models for Llama2-7B and 48 models each for Llama3-8B and Gemma2-9B) are available on the Hugging Face Model Hub. If you would like to use these models for practical use, we highly receommend using models adapted with Align + 2x2 LS + MTP + 512. Other models are not recommended for practical use and they are provided for analysis purposes only. Please see the discussions and recommendations in the paper.
Not recommended for practical use
Approach | Link |
---|---|
LAPT | ar / my / de / el / hi / ja / si / sw / te / th |
Random | ar / my / de / el / hi / ja / si / sw / te / th |
FOCUS | ar / my / de / el / hi / ja / si / sw / te / th |
Mean | ar / my / de / el / hi / ja / si / sw / te / th |
Merge | ar / my / de / el / hi / ja / si / sw / te / th |
Align | ar / my / de / el / hi / ja / si / sw / te / th |
Not recommended for practical use except for 2x2 LS + MTP + 512 models
Approach | Link |
---|---|
LoRA | |
CLM + 2048 | ar / my / el / hi / si / te |
MTP + 2048 | ar / my / el / hi / si / te |
CLM + 512 | ar / my / el / hi / si / te |
MTP + 512 | ar / my / el / hi / si / te |
2 stage | |
CLM + 2048 | ar / my / el / hi / si / te |
MTP + 2048 | ar / my / el / hi / si / te |
CLM + 512 | ar / my / el / hi / si / te |
MTP + 512 | ar / my / el / hi / si / te |
2x2 LS | |
CLM + 2048 | ar / my / el / hi / si / te |
MTP + 2048 | ar / my / el / hi / si / te |
CLM + 512 | ar / my / el / hi / si / te |
MTP + 512 | ar / my / el / hi / si / te |
Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.
Approach | my | si | te |
---|---|---|---|
Random | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Mean | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Align | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
You might be able to use these models for practical use.
Approach | Link |
---|---|
LAPT | my / si / te |
Random | my / si / te |
Mean | my / si / te |
Align | my / si / te |
Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.
Approach | my | si | te |
---|---|---|---|
Random | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Mean | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Align | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
You might be able to use these models for practical use.
Approach | Link |
---|---|
LAPT | my / si / te |
Random | my / si / te |
Mean | my / si / te |
Align | my / si / te |
Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.
Approach | my | si | te |
---|---|---|---|
Random | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Mean | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
Align | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 | 50 / 100 / 500 / 1000 / 5000 |
This code is licensed under the MIT License. The models are licensed under the respective licenses of the original models. Please refer to the Hugging Face Model Hub for the licenses of the models.
If you use this code or models, please cite the following paper.
@article{yamaguchi-etal-2024-effectively,
title={How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?},
author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
year={2024},
journal={ArXiv},
year={2024},
volume={abs/2406.11477},
url={https://arxiv.org/abs/2406.11477},
}