Skip to content

Commit 393327d

Browse files
authored
Update README.md
1 parent 3a871c3 commit 393327d

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

README.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ After initial setup described at [Quickstart](#quickstart), our dataset will be
5252

5353
5. [NCS_preprocessed_data](https://mega.nz/file/45BXRSSb#sj2bSC9AHxralmpAud6Uy1_g6HOFnZq0Dk4pfqiP-1M): This file contains the preprocessed data for neural code summarization networks.
5454

55-
6. *** add bpe tokenized ncs preprocessed data here ***
55+
6. [BPE_Tokenized_NCS_preprocessed_data](https://drive.google.com/file/d/14nHVljNMb37-tpOW59NaDY26T6z2BcXD/view?usp=sharing): This file contains the preprocessed data for neural code summarization networks with BPE tokenization.
5656

5757
## Python to Java Translation
5858
We have created a forked repository of [Transcoder](https://github.com/code-desc/TransCoder.git) that facillicates parallel translation of source codes and speeds up the process by 16 times. Instructions to use Transcoder can be found in the above mentioned repository. The original work is published under the title ["Unsupervised Translation of Programming Languages"](https://arxiv.org/abs/2006.03511).
@@ -81,16 +81,14 @@ The following command preprocesses CoDesc dataset for [NeuralCodeSum](https://ar
8181
```
8282

8383
# Tokenizer
84-
if huggingface tokenizer library is not installed run this command
85-
```bash
86-
pip install transformers
87-
```
88-
to train and create tokenized files using bpe use this command
84+
The tokenizers for source codes and natural language descriptions are given in the `Tokenizer/` directory. To use the tokenizers in python, `code_filter` and `nl_filter` functions will have to be imported from `Tokenizer/CodePreprocess_final.py` and `Tokenizer/NLPreprocess_final.py`. Moreover, two json files named `code_filter_flag.json` and `nl_filter_flag.json` containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the `Tokenizer/` folder. These flag options are also briefly described in the above mentioned json files.
85+
86+
To train and create tokenized files using bpe, use the following command.
8987
```bash
9088
python Tokenizer/huggingface_bpe.py
9189
```
9290

93-
The tokenizers for source codes and natural language descriptions are given in the `Tokenizer/` directory. To use the tokenizers in python, `code_filter` and `nl_filter` functions will have to be imported from `Tokenizer/CodePreprocess_final.py` and `Tokenizer/NLPreprocess_final.py`. Moreover, two json files named `code_filter_flag.json` and `nl_filter_flag.json` containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the `Tokenizer/` folder. These flag options are also briefly described in the above mentioned json files.
91+
9492

9593
# Code Search
9694
During the initial setup described at [Quickstart](#quickstart), a forked version of [CodeSearchNet](https://github.com/code-desc/CodeSearchNet.git) is cloned into the working directory, and the preprocessed data of CoDesc will be copied to `CodeSearchNet/resources/data/` directory. To use the preprocessed dataset of balanced partition, clear the above mentioned folder, and copy the content inside of `data/csn_preprocessed_data_balanced_partition/` into it.
@@ -110,7 +108,7 @@ Then the following commands will train and test code search networks:
110108
```
111109

112110
# Code Summarization
113-
We used the original implementation of Code Summarization. [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum.git)
111+
We used the original implementation of Code Summarization of [NeuralCodeSum](https://github.com/wasiahmad/NeuralCodeSum.git). Please refer to [this guide](https://github.com/code-desc/CoDesc/blob/master/CodeSummarization/data/README.md) for instructions on how to train the code summarization network.
114112

115113

116114
# Licenses

0 commit comments

Comments
 (0)