GitHub - ayeffkay/rubert-tiny: Scripts for BERT distilllation with reduced vocabulary

This is official implementation of the paper Knowledge Distillation of Russian Language Models with Reduction of Vocabulary.

Citation

If you found this code or results from the paper useful, we are kindly ask you to cite the paper:

@misc{https://doi.org/10.48550/arxiv.2205.02340,
  doi = {10.48550/ARXIV.2205.02340},
  
  url = {https://arxiv.org/abs/2205.02340},
  
  author = {Kolesnikova, Alina and Kuratov, Yuri and Konovalov, Vasily and Burtsev, Mikhail},
  
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Knowledge Distillation of Russian Language Models with Reduction of Vocabulary},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Credits

This code is based on DistilBERT official implementation https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation.

Dependencies

pip install -r requirements.txt

Data

For training you'll need

folder processed_binarized (binarized shards)
rubert_tiny_weights.pth (student weights for initialization)
ru_convers (teacher with LM head and fixed configs), distilrubert_tiny_cased_convers (not trained student)
teacher2student.pickle (mapping dict), t2s_padded.pickle, s2t_padded.pickle (padded matrices)
teacher_counts.pickle, student_counts.pickle (counts for sampling to generate masks)

Preprocessing scripts

(not needed for training)

./preprocessing/:

train_requced_tok.py -- vocabulary trainer
teacher_tokens_split.py
student2teacher.py
encode_shards.py -- tokens to ids, outputs (teacher_ids, student_ids)
prepare_lm_seqs.py -- filter (split too long sequences, remove sequences with big unk counts, etc.)
regroup_binarized.py -- split shards to equal sizes
token_counts.py -- compute token ids in processed data
init_weights.py -- student initialization
find_matching_ids.py -- add mask with matching teacher and student matching ids
matched_vocab.py -- outputs dict of the form {matched_vocab_token: [teacher_id, student_id]}

Training scripts

To run train (you'll probably need to change GPU count before running):

chmod +x {script_name}.sh
./{script_name}.sh

Required scripts:

train.py -- wrapper to run train
distiller.py -- trainer
lm_seqs_dataset.py -- batch generator
custom_step.py -- functions one train/valid step with different losses
my_index.py -- backward optimization
grouped_batch_sampler.py -- group batches by length, to reduce padding
utils.py -- auxilary utils for training
setup_logger.py -- initialize file logger for few separate scripts to be able to write logs into the same file [optional]

Hyperbolic scripts

delta.py -- functions to precompute curvature
hyptorch/ -- hyperbolic layers and related stuff

GLUE

distil-finetuned-en folder contains scripts for fine-tuning teachers on GLUE and distillation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Citation

Credits

Dependencies

Data

Preprocessing scripts

Training scripts

Hyperbolic scripts

GLUE

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
distil-finetuned-en		distil-finetuned-en
hyptorch		hyptorch
preprocessing		preprocessing
scripts		scripts
.gitignore		.gitignore
README.md		README.md
custom_step.py		custom_step.py
delta.py		delta.py
distiller.py		distiller.py
grouped_batch_sampler.py		grouped_batch_sampler.py
lm_seqs_dataset.py		lm_seqs_dataset.py
my_index.py		my_index.py
requirements.txt		requirements.txt
setup_logger.py		setup_logger.py
train.py		train.py
utils.py		utils.py

ayeffkay/rubert-tiny

Folders and files

Latest commit

History

Repository files navigation

Citation

Credits

Dependencies

Data

Preprocessing scripts

Training scripts

Hyperbolic scripts

GLUE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages