In this project created for EPFL's CS-433: Machine Learning, we explore the use of a transformer model for page-prefetching.
You can find the requirements in the requirements.txt file under models/transformer/. To install them, run the following command:
pip install -r requirements.txtThe project is structured as follows:
data/
├── prepro/
├── processed/
└── raw/
dataset_gathering
models/
├── config.py
├── train.py
├── infer.py
├── model.py
├── data_parser.py
├── dataset.py
├── make_tokens.py
├── runs/
└── trainings/
- In
datayou can find our different raw data. The raw data directly collected can be found inraw, the preprocessed data inpreproand the processed, fully-cleaned data which we use in training inprocessed. dataset_gatheringcontains the code used to collect the raw data.modelscontains the code used to train, evaluate and the actual code of the transformer model.config.pywhich can be used to tweak its parameters, e.g. the number of layers, the number of heads, the number of epochs, etc.train.pycontains the code to train the modelinfer.pycontains the code to use the model for inferencemodel.pycontains the code of the model itselfdata_parser.pycontains the code to parse the datadataset.pycontains the code to create the dataset, parsing the raw data in a structure usable by our model and tokenizing itmake_tokens.pycontains the tokenizer code, both for the input and the outputrunscontains the tensorboard logs, which you can use to visualize the trainingtrainingscontains the saved models, which you can use for inference and for further training
In `config.py' you can tweak many of the model's parameters, such as the number of layers, the number of heads, the number of epochs, etc, but also parameters for the tokenizers. Here, we explain how these parameters will affect the model and what values they can take
DATA_PATH: the folder where the data file isOBJDUMP_PATH: folder whereobjdumpoutput for libraries used by traced programs isGENERATOR_PREFIX: prefix of the folder where the generator will be savedSEED_FN: the seed to use for the generatorSTATE_FN: the name of the state to use for the generator state saving / loadingTRACE_TYPE: the type of trace to use, eitherfltraceorbpftrace
BPF_FEATURES is a list of features used to train / infer with the model, collected with the BPF method. It contains:
prev_faultswhich is an hex-address list, containing the addresses of the previous page faultsflagswhich is a bitmap, containing the flags of the page fault, includesrW.ipwhich is the instruction pointer of the CPUregswhich is a list of the values of the registers of the CPU
FL_FEATURES is a list of features used to train / infer with the model, collected with the fltrace. It contains:
prev_faultswhich is an hex-address list, containing the addresses of the previous page faultsrW: whether the page was read from or written toips: stack trace of the program
OUTPUT_FEATURES contains the output features of the model, which is by default only one element: the hex-addresses of the next pages to prefetch.
Transformer parameters are set in the the TransformerModelParams class and Meta Transformer parameters are set in MetaTransformerParams, both in config.py. They are:
d_model: the dimension of the modelT: the number of transformer block layersH: the numbe of attention heads per transformer layerdropout: the dropout rated_ff: the dimension of the feed-forward layer
In the get_config function in config.py, we create the configuration, you can modify most parameters there.
bpe_special_tokens: the special tokens used by the tokenizers. Default:[UNK]pad_token: the padding token used by the tokenizers. Default:[[PAD]]list_elem_separation_token: the token used to separate elements in a list. Default:(space) For use, see comment inTokenizerWrapperofspecial_tokenizers.pyfeature_separation_token: the token used to separate features in a list. Default:[FSP]start_stop_generating_tokens: the tokens used to indicate the start and the end of a sentence. Default:[GTR]and[GTP]batch_size: the batch size to use for training. Default: 16num_epochs: the number of epochs to train for. Default: 10lr: the learning rate to use for training. Default 10^(-4)trace_type: the type of trace to use, seeTRACE_TYPEabove. Default:fltrace, choose withbpftracetrain_on_trace: forfltraceonly, whether we train on one trace or multipledatasource: name of the benchmark on which we trainsubsample: the subsampling rate to use for the data. Default: 0.2objdump_path: seeOBJDUMP_PATHabovemodel_folder: the folder where the model will be saved. Default:modelspreload: which version of the model to preload, defaultlatest(takes the highest epoch number)tokenizer_files: format string path to the the tokenizer files. Default:trained_tokenizers/[TRACETYPE]/tokenizer_[src,tgt].jsontrain_test_split: the train / test split to use. Default: 0.75attention_model: the type of model to use for attention. Default:transformer, choose withretnetattention_model_params: the parameters of the attention model. Default:TransformerModelParams, not needed for RetNetdecode_algorithm: the algorithm to use for decoding. Default:beam(beam search), choose withgreedy(greedy decode)beam_size: ifdecode_algorithmisbeam, the beam size to use. Default: 4past_window: the size of the past window of previous faults to use. Default: 10k_predictions: the number of predictions to make. Default: 10code_window: tuple of number of instructions before and after the instruction pointer, i.e. code window around IP. Default (1,2)input_features: the features to use as input. Default:BPF_FEATURES, choose withFL_FEATURESoutput_features: the features to use as output. Default:OUTPUT_FEATURESbase_tokenizer: the base tokenizer to use. Default:hextet, choose withbpe,text. See tokenizers section.embedding_technique: the embedding technique used on the tokens. See embeddings. Default:tok_concat, choose withonetext,meta_transformerandembed_concatmeta_transformer_params: the parameters of the meta transformer. Default:MetaTransformerParamspage_masked: forbpftraceonly, map all address accesses to their page numbers.max_weight_save_history: used whenmass_train == Truein training. Defines how many epochs we should save at most. Default: 3
In models/trained_tokenizers/special_tokenizers.py, we define generic classes of tokenizers, which are then trained on a specific vocabulary.
We have three generic classes:
SimpleCustomVocabTokenizerTokenizerWrapperConcatTokenizerDetails can be found in the docstrings of the classes.
To train the model on our dataset, simply run the train.py script, i.e.:
python train.pyImportant! this assumes you have already copied the dataset, in the right folders, as specified above.
You can tweak the parameters of the model in the config.py file.
To use the model for inference, simply run the infer.py script, i.e.:
python infer.pyYou can define your input string in the infer.py file (data parameter) and the maximum length of the output (max_length parameter).
Victor Garvalov @vigarov, Alex Müller @ktotam1, Thibault Czarniak @t1b00.
Thank you to Professors Martin Jaggi, Nicolas Flammarion, Sasha, and our wonderful TAs.