Datahack 2019 Project
This is an auto code description generation and code search engine project.
It's based on StackOverFlow dataset, focusing on Data Science and Data Structures fields only.
srcMain source code directory.rsourcesExternal resources necessary for running this project, like the BERT's vocabularytxtfile.scriptsHelper scripts such that:- SQL query for fetching the StackOverFlow data from Google's BigQuery service.
shellscripts for running each step in the process.
- Make sure to have
Python 3.6 - Install
pipenvbypip install pipenv - In your terminal, create a new virtual environment inside a new shell, using the command
pipenv shell(make sure to run all commands inside this shell to not affect your global environment settings). This should create a.venvfolder inside the project's root folder. - Install all the requirements using the
PipfileandPipfile.lockfiles by running the following command:pipenv sync. Note: If using a GPU machine (recommended) one needs to changetensorflowtotensorflow-gpu,
- Fetch the data from Google's BigQuery service using the script
scripts/bigquery_stackoverflow.sql. They supply free trail of 300$ which is more than enough for this task. - Clone Google's BERT code into
src/bertcode/bert(the folder is in.gitignore). - Download BERT's base uncased model for English into
bert/models/uncased_L-12_H-768_A-12(the folder is in.gitignore)
Please follow the scripts in resources folder for all running examples.
Acknowledgements
- Amenity Analytics for the credit and resources. Thanks!
- Main idea derived from hamelsmu with some modifications to fit to the problem of generating comment from code, mainly in the data, pre-processing, cleaning and sentence embedding mechanism.