Skip to content

StaQC: a systematically mined dataset containing around 148K Python and 120K SQL domain question-code pairs, as described in "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18)

License

Notifications You must be signed in to change notification settings

mrezende/StackOverflow-Question-Code-Dataset

 
 

Repository files navigation

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

1. StaQC dataset

1.1 Introduction

StaQC (Stack Overflow Question-Code pairs) is the largest dataset to date of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network, as described in the paper "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18).

StaQC is collected from three sources: multi-code answer posts, single-code answer posts, and manual annotations on multi-code answer posts:

#of question-code pair
Source Python SQL
Multi-Code Answer Posts 60,083 41,826
Single-Code Answer Posts 85,294 75,637
Manual Annotation 2,169 2,056
Sum 147,546 119,519

1.2 Multi-code answer posts & manual annotations

A Multi-code answer post is an (accepted) answer post that contains multiple code snippets, some of which may not be a standalone code solution to the question (see Section 1 in paper). For example, in this multi-code answer post, the third code snippet is not a code solution to the question "How to limit a number to be within a specified range? (Python)".

The question-code pairs automatically mined or manually annotated from multi-code answer posts can be found here: Python and SQL.
Format: Each line corresponds to one code snippet, which can be paired with its question. The code snippet is identified by (question id, code snippet index), where the code snippet index refers to the index (starting from 0) of the code snippet in the accepted answer post of this question. For example, (5996881, 0) refers to the first code snippet in the accepted answer post of the question with id "5996881", which can be paired with its question "How to limit a number to be within a specified range? (Python)".
Source data: Python Pickle files. Please open with pickle.load(open(filename)).

  • Code snippets for Python and SQL: A dict of {(question id, code index): code snippet}.
  • Question titles for Python and SQL: A dict of {question id: question title}.

1.3 Single-code answer posts

A Single-code answer post is an (accepted) answer post that contains only one code snippet. We pair such code snippet with the question title as a question-code pair.

Source data: Python Pickle files. Please open with pickle.load(open(filename)).

  • Code snippets for Python and for SQL): A dict of {question id: accepted code snippet}.
  • Question titles for Python and SQL: A dict of {question id: question title}.

2. Software

2.1 Prerequisite

2.2 Manual annotations

Human annotations can be found: Python and SQL. Both are pickle files.

2.3 How-to-do-it question type classifier

The script that extracts features for constructing a "how-to-do-it" question type classifier can be found here. The 250 manually annotated posts for Python and SQL can be found here (label '1' denotes "how-to-do-it"). For details, please refer to Section 2.2.1 in our paper.

2.4 Code snippet processing

The script for processing code snippets can be found here. For details, please read Section 5.1 in our paper. The implementation of the SQL parser is adapted from https://github.com/sriniiyer/codenn.

  1. Installing package cd data_processing/codenn/src/sqlparse/ python setup.py install
  2. Processing code snippets (tokenization, normalizing variable name, etc.)
    cd data_processing
    The tokenize_code_corpus function receives a dictionary of code snippets and returns the paring results. Please run python code_processing.py for testing.

2.5 Run BiV-HNN

We provide processed training/validation/testing files in our experiments here.

  1. Before running, please unzip the word embedding files for Python (code_word_embedding.gz*) following:
    cd data/data_hnn/python/train/
    cat code_word_embedding.gza* | zcat > rnn_partialcontext_word_embedding_code_150.pickle
    rm code_word_embedding.gza*
    then go back the code dir:
    cd ../../../../BiV_HNN/.

    No other operations demanded for SQL data.

  2. Train:
    For Python data:

    python run.py --train --train_setting=1 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-24379-0-1-0-1" --code_model_setting="64-150-218900-0-1-0-1" --query_model_setting="64-150-24379-0-1-0-1" --keep_prob=0.5
    

    For SQL data:

    python run.py --train --train_setting=2 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-13698-0-1-0-1" --code_model_setting="64-150-33192-0-1-0-1" --query_model_setting="64-150-13698-0-1-0-1" --keep_prob=0.7
    

    The above program trains the BiV-HNN model. It will print the model's learning process on the training set, and its performance on the validation set and the testing set.

    For training Text-HNN, set:
    --code_model=0 --query_model=0 --code_model_setting=None --query_model_setting=None to dismiss the code and query modeling.

    For training Code-HNN, set:
    --text_model=0 --text_model_setting=None
    to dismiss the text modeling.

  3. Test:
    You may revise the test function in run.py for testing other datasets, and run the above command (Note: replace --train with --test).

3. Cite

If you use the dataset or the code in your research, please cite the following paper:

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow
Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, Huan Sun

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

About

StaQC: a systematically mined dataset containing around 148K Python and 120K SQL domain question-code pairs, as described in "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.4%
  • HTML 2.6%
  • TSQL 1.9%
  • CSS 1.2%
  • PLpgSQL 0.6%
  • Makefile 0.6%
  • Other 0.7%