This is official implementation of the paper Knowledge Distillation of Russian Language Models with Reduction of Vocabulary.
If you found this code or results from the paper useful, we are kindly ask you to cite the paper:
@misc{https://doi.org/10.48550/arxiv.2205.02340,
doi = {10.48550/ARXIV.2205.02340},
url = {https://arxiv.org/abs/2205.02340},
author = {Kolesnikova, Alina and Kuratov, Yuri and Konovalov, Vasily and Burtsev, Mikhail},
keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Knowledge Distillation of Russian Language Models with Reduction of Vocabulary},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
This code is based on DistilBERT official implementation https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation.
pip install -r requirements.txt
For training you'll need
- folder
processed_binarized(binarized shards) rubert_tiny_weights.pth(student weights for initialization)ru_convers(teacher with LM head and fixed configs),distilrubert_tiny_cased_convers(not trained student)teacher2student.pickle(mapping dict),t2s_padded.pickle,s2t_padded.pickle(padded matrices)teacher_counts.pickle,student_counts.pickle(counts for sampling to generate masks)
(not needed for training)
./preprocessing/:
train_requced_tok.py-- vocabulary trainerteacher_tokens_split.pystudent2teacher.pyencode_shards.py-- tokens to ids, outputs (teacher_ids, student_ids)prepare_lm_seqs.py-- filter (split too long sequences, remove sequences with big unk counts, etc.)regroup_binarized.py-- split shards to equal sizestoken_counts.py-- compute token ids in processed datainit_weights.py-- student initializationfind_matching_ids.py-- add mask with matching teacher and student matching idsmatched_vocab.py-- outputs dict of the form{matched_vocab_token: [teacher_id, student_id]}
To run train (you'll probably need to change GPU count before running):
chmod +x {script_name}.sh
./{script_name}.sh
Required scripts:
train.py-- wrapper to run traindistiller.py-- trainerlm_seqs_dataset.py-- batch generatorcustom_step.py-- functions one train/valid step with different lossesmy_index.py-- backward optimizationgrouped_batch_sampler.py-- group batches by length, to reduce paddingutils.py-- auxilary utils for trainingsetup_logger.py-- initialize file logger for few separate scripts to be able to write logs into the same file [optional]
delta.py-- functions to precompute curvaturehyptorch/-- hyperbolic layers and related stuff
distil-finetuned-en folder contains scripts for fine-tuning teachers on GLUE and distillation