Skip to content

Code for "How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?"

License

Notifications You must be signed in to change notification settings

gucci-j/lowres-cve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

This is the official code for the paper titled "How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?"

For reproduction, please refer to Reproduction.
For models developed in the paper, please refer to Adapted Models.

Requirements

  • Python 3.12.4 or later
  • CUDA 12.4
  • torch
  • transformers
  • peft
  • datasets
  • evaluate
  • bitsandbytes
  • scikit-learn
  • sentencepiece
  • huggingface-hub
  • lighteval
  • openai
  • tqdm
  • pyarrow
  • entmax
  • fastdist
  • rouge-score
  • numba
  • lighteval
  • openai
  • tiktoken
  • BLEURT==0.0.2 (See below)
  • fasttext==0.9.2 (See below)

Installation

After manually installing PyTorch and transformers, please run the following.

# fastText
pip install -r requirements.txt
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip install .

# BLEURT
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip

Reproduction

1. Preprocessing

Please see the preprocessing directory for preprocessing.

2. Target model initialization

Please see the initialization directory for target model initialization.

3. Language adaptive pre-training

Please see the lapt directory for language adaptive pre-training.

4. Evaluation

Please see the eval directory for evaluation.

Adapted Models

All adapted models (168 models for Llama2-7B and 48 models each for Llama3-8B and Gemma2-9B) are available on the Hugging Face Model Hub. If you would like to use these models for practical use, we highly receommend using models adapted with Align + 2x2 LS + MTP + 512. Other models are not recommended for practical use and they are provided for analysis purposes only. Please see the discussions and recommendations in the paper.

Llama2

Models used for target vocabulary initialization method analysis

Not recommended for practical use

Approach Link
LAPT ar / my / de / el / hi / ja / si / sw / te / th
Random ar / my / de / el / hi / ja / si / sw / te / th
FOCUS ar / my / de / el / hi / ja / si / sw / te / th
Mean ar / my / de / el / hi / ja / si / sw / te / th
Merge ar / my / de / el / hi / ja / si / sw / te / th
Align ar / my / de / el / hi / ja / si / sw / te / th

Models used for training strategy analysis

Not recommended for practical use except for 2x2 LS + MTP + 512 models

Approach Link
LoRA
CLM + 2048 ar / my / el / hi / si / te
MTP + 2048 ar / my / el / hi / si / te
CLM + 512 ar / my / el / hi / si / te
MTP + 512 ar / my / el / hi / si / te
2 stage
CLM + 2048 ar / my / el / hi / si / te
MTP + 2048 ar / my / el / hi / si / te
CLM + 512 ar / my / el / hi / si / te
MTP + 512 ar / my / el / hi / si / te
2x2 LS
CLM + 2048 ar / my / el / hi / si / te
MTP + 2048 ar / my / el / hi / si / te
CLM + 512 ar / my / el / hi / si / te
MTP + 512 ar / my / el / hi / si / te

Models used for target vocabulary size analysis

Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.

Approach my si te
Random 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Mean 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Align 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000

Llama3

Models adapted using 2x2 LS + MTP + 512

You might be able to use these models for practical use.

Approach Link
LAPT my / si / te
Random my / si / te
Mean my / si / te
Align my / si / te

Models used for target vocabulary size analysis

Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.

Approach my si te
Random 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Mean 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Align 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000

Gemma2

Models adapted using 2x2 LS + MTP + 512

You might be able to use these models for practical use.

Approach Link
LAPT my / si / te
Random my / si / te
Mean my / si / te
Align my / si / te

Models used for target vocabulary size analysis

Not recommended for practical use except for models with $|\mathcal{V}_\text{new}|$=50 or 100 models. Please see the discussions and recommendations in the paper.

Approach my si te
Random 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Mean 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000
Align 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000 50 / 100 / 500 / 1000 / 5000

License

This code is licensed under the MIT License. The models are licensed under the respective licenses of the original models. Please refer to the Hugging Face Model Hub for the licenses of the models.

Citation

If you use this code or models, please cite the following paper.

@article{yamaguchi-etal-2024-effectively,
    title={How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?}, 
    author={Atsuki Yamaguchi and Aline Villavicencio and Nikolaos Aletras},
    year={2024},
    journal={ArXiv},
    year={2024},
    volume={abs/2406.11477},
    url={https://arxiv.org/abs/2406.11477}, 
}

About

Code for "How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published