You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-8Lines changed: 6 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ After initial setup described at [Quickstart](#quickstart), our dataset will be
52
52
53
53
5.[NCS_preprocessed_data](https://mega.nz/file/45BXRSSb#sj2bSC9AHxralmpAud6Uy1_g6HOFnZq0Dk4pfqiP-1M): This file contains the preprocessed data for neural code summarization networks.
54
54
55
-
6.*** add bpe tokenized ncs preprocessed data here ***
55
+
6.[BPE_Tokenized_NCS_preprocessed_data](https://drive.google.com/file/d/14nHVljNMb37-tpOW59NaDY26T6z2BcXD/view?usp=sharing): This file contains the preprocessed data for neural code summarization networks with BPE tokenization.
56
56
57
57
## Python to Java Translation
58
58
We have created a forked repository of [Transcoder](https://github.com/code-desc/TransCoder.git) that facillicates parallel translation of source codes and speeds up the process by 16 times. Instructions to use Transcoder can be found in the above mentioned repository. The original work is published under the title ["Unsupervised Translation of Programming Languages"](https://arxiv.org/abs/2006.03511).
@@ -81,16 +81,14 @@ The following command preprocesses CoDesc dataset for [NeuralCodeSum](https://ar
81
81
```
82
82
83
83
# Tokenizer
84
-
if huggingface tokenizer library is not installed run this command
85
-
```bash
86
-
pip install transformers
87
-
```
88
-
to train and create tokenized files using bpe use this command
84
+
The tokenizers for source codes and natural language descriptions are given in the `Tokenizer/` directory. To use the tokenizers in python, `code_filter` and `nl_filter` functions will have to be imported from `Tokenizer/CodePreprocess_final.py` and `Tokenizer/NLPreprocess_final.py`. Moreover, two json files named `code_filter_flag.json` and `nl_filter_flag.json` containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the `Tokenizer/` folder. These flag options are also briefly described in the above mentioned json files.
85
+
86
+
To train and create tokenized files using bpe, use the following command.
89
87
```bash
90
88
python Tokenizer/huggingface_bpe.py
91
89
```
92
90
93
-
The tokenizers for source codes and natural language descriptions are given in the `Tokenizer/` directory. To use the tokenizers in python, `code_filter` and `nl_filter` functions will have to be imported from `Tokenizer/CodePreprocess_final.py` and `Tokenizer/NLPreprocess_final.py`. Moreover, two json files named `code_filter_flag.json` and `nl_filter_flag.json` containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the `Tokenizer/` folder. These flag options are also briefly described in the above mentioned json files.
91
+
94
92
95
93
# Code Search
96
94
During the initial setup described at [Quickstart](#quickstart), a forked version of [CodeSearchNet](https://github.com/code-desc/CodeSearchNet.git) is cloned into the working directory, and the preprocessed data of CoDesc will be copied to `CodeSearchNet/resources/data/` directory. To use the preprocessed dataset of balanced partition, clear the above mentioned folder, and copy the content inside of `data/csn_preprocessed_data_balanced_partition/` into it.
@@ -110,7 +108,7 @@ Then the following commands will train and test code search networks:
110
108
```
111
109
112
110
# Code Summarization
113
-
We used the original implementation of Code Summarization. [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum.git)
111
+
We used the original implementation of Code Summarization of [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum.git). Please refer to [this guide](https://github.com/code-desc/CoDesc/blob/master/CodeSummarization/data/README.md) for instructions on how to train the code summarization network.
0 commit comments