Today, many studies focus on applying neural networks to software engineering tasks such as comment generation, code search, clone detection, and so on. Among them, the program translation task requires the model to translate the source code to the target code without changing its functionality. This task requires the model to understand the source code semantics and generate code based on the specifications of the target programming language.
This repository is created to investigate the program translation baseline Transformer. The CodeTrans dataset is shown in CodeXGLUE/CodeTrans.
In addition, our model has some feature, such as:
- simple modification of parameters
- gradient accumulation
tf.functionacceleration- multi-GPU training
- mixed precision (float16 and float32)
It should be noted that the gradient accumulation function is copied from OpenNMT-tf.
- tensorflow 2
- tokenizers
- numpy
- tree-sitter
Besides, if evaluating the output, pycharm is required. (To be honest, my programming skills are limited)
./datafolder is used to store datasets, vocabulary, references, model's ckpt, and predicted code../evaluatorfolder holds the evaluation metrics. The evaluation metrics are from CodeTrans../networkand./utilfolders store the model and preprocessing files.
I believe you can see the config dict in train.py. Just change the value corresponding to the key listed in config.
Note that the "swap datasets by dictionary order": False refers to translate the name of a programming language with a small dictionary order to another programming language.
- Save the dataset with the files like
keyword.file_name.languageto./data/dataset_name/source/. Wherekeywordin[train, valid, test],languageis the program language that the tree-sitter can parse. - run
prepare_data.pyto preprocess dataset. - run
train.pyto create the transformer model and generate output. - run
metric_eval.pyto evaluate the output in terms of BLEU, EM, CodeBLEU metrics.
Where step 4 needs to be run in pycharm, select the folder evaluator/CodeBLEU and mark directory as sources root.
Note that I did not set up warmup because of the high learning rate with few training steps.
Java to C#
| model | layer | hidden | learning rate | BLEU | Exact Match | CodeBLEU |
|---|---|---|---|---|---|---|
| Transformer-baseline | 12 | 768 | - | 55.84 | 33.0 | 63.74 |
| Transformer | 12 | 768 | 1e-4 | 50.64 | 31.3 | 58.24 |
| Transformer | 12 | 768 | 5e-5 | 53.01 | 35.2 | 60.98 |
C# to Java
| model | layer | hidden | learning rate | BLEU | Exact Match | CodeBLEU |
|---|---|---|---|---|---|---|
| Transformer-baseline | 12 | 768 | - | 50.47 | 37.9 | 61.59 |
| Transformer | 12 | 768 | 1e-4 | 45.01 | 31.4 | 53.06 |
| Transformer | 12 | 768 | 5e-5 | 45.91 | 33.0 | 53.89 |
My research is in program translation, and I hope I can graduate successfully.