This is the repository for the paper Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive Code Generation .
We would like to thank Frank F. Xu and Pengcheng Yin for their helpful discussions and for sharing their code. Some code has come from the TranX and External Knowledge Codegen repositories.
We would also like to thank the work that inspired this one:
TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation by Pengcheng Yin and Graham Neubig
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation by Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig
Run the Google colab found Notebook Link for our best performing model.
We also provide all of the generated samples from our test with the inputs here .
Note: It will take 1-2 (Maybe 3) hours to train and run on Google Colab
You need Python to use Python 3.8. I would recommend using a virtual environment.
- Install the requirements
from
requirements.txt
pip install -r requirements.txt
- To run the model, run
the
experiment.py
script. You can usepython experiment.py -h
or the documentation in the file to understand the different options. But to use our best model, run
python experiment.py best "facebook/bart-base" bartBase -combine-mined
- Then in the
scratch
directory, you will find the results in a json file.
Here is our dataset that we used.
This dataset is the cleaned data using the process we describe further down. NOTE For the time being this only includes 10,000 mined examples. It will be updated to include all cleaned mined examples.
You can find a sample schema for this data here .
For the body
key, there are unclosed html tags in the text. Eventually these will be taken out.
But for now, the easy but bad solution is to use the regex <\w+>
. The good solution is to use
the html tags file
to remove them. Note, you must surround the tag text with < >
.
Link to the parsed StackOverflow Questions
For actually working with this data:
- The JSON file has the structure:
{
"question_id": {
"question_id": "str",
"tags": "List[str]",
"title": "str",
"accepted_answer_id": "int or null",
"score": "int",
"body": "str",
"code_slots": "Ignore this, it is useless",
"answers": {
"answer_id": {
"score": "int",
"body": "str",
"code_slots": "Ignore"
}
}
}
}
-
For the
body
key, there are unclosed html tags in the text. Eventually these will be taken out. But for now, the easy but bad solution is to use the regex<\w+>
. The good solution is to use the html tags file to remove them. Note, you must surround the tag text with< >
. -
Finally, you must match the question ids from CoNaLa to the SO data.
If you use this dataset you MUST cite the original CoNaLa paper as well:
@misc{orlanski2021reading,
title={Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation},
author={Gabriel Orlanski and Alex Gittens},
year={2021},
eprint={2106.04447},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{yin2018mining,
author = {Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
title = {Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow},
booktitle = {International Conference on Mining Software Repositories},
series = {MSR},
pages = {476--486},
year = {2018},
publisher = {ACM},
doi = {https://doi.org/10.1145/3196398.3196408},
}