DeepBugs is a framework for learning name-based bug detectors from an existing code corpus. See our OOPSLA'18 paper for a detailed description.
A quick and easy way to play with a simplified version of DeepBugs is a Jupyter notebook, which you can run on Google's Colaboratory. To use the full DeepBugs tool, read on.
- All commands are called from the main directory.
- Python code (most of the implementation) and JavaScript code (for extracting data from .js files) are in the
/python
and/javascript
directories. - All data to learn from, e.g., .js files are expected to be in the
/data
directory. - All data that is generated, e.g., intermediate representations, are written into the main directory. It is recommended to move them into separate directories.
- All generated data files have a timestamp as part of the file name. Below, all files are used with
*
. When running commands multiple times, make sure to use the most recent files.
- Node.js
- npm modules (install with
npm install module_name
): acorn, estraverse, walk-sync - Python 3
- Python packages: keras, scipy, numpy, sklearn
- The full corpus can be downloaded here and is expected to be stored in
data/js/programs_all
. It consists of 100.000 training files, listed indata/js/programs_training.txt
, and 50.000 files for validation, listed indata/js/programs_eval.txt
. - This repository contains only a very small subset of the corpus. It is stored in
data/js/programs_50
. Training and validation files for the small corpus are listed indata/js/programs_50_training.txt
anddata/js/programs_50_eval.txt
.
Creating a bug detector consists of two main steps:
- Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.
- Train a classifier to distinguish correct from incorrect code examples.
Each bug detector addresses a particular bug pattern, e.g.:
- The
SwappedArgs
bug detector looks for accidentally swapped arguments of a function call, e.g., callingsetPoint(y,x)
instead ofsetPoint(x,y)
. - The
BinOperator
bug detector looks for incorrect operators in binary operations, e.g.,i <= len
instead ofi < len
. - The
IncorrectBinaryOperand
bug detector looks for incorrect operands in binary operations, e.g.,height - x
instead ofheight - y
.
node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50
- The
--parallel
argument sets the number of processes to run. programs_50_training.txt
contains files to include (one file per line). To extract data for validation, run the command withdata/js/programs_50_eval.txt
.- The last argument is a directory that gets recursively scanned for .js files, considering only files listed in the file provided as the second argument.
- The command produces
calls_*.json
files, which is data suitable for theSwappedArgs
bug detector. For the other bug two detectors, replacecalls
withbinOps
in the above command.
A) Train and validate the classifier
python3 python/BugLearnAndValidate.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json --validation_data calls_yy*.json
- The first argument selects the bug pattern.
- The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
- The remaining arguments are two lists of .json files. They contain the training and validation data extracted in Step 1.
- After learning the bug detector, the command measures accurracy and recall w.r.t. seeded bugs and writes a list of potential bugs in the unmodified validation code (see
poss_anomalies.txt
).
B) Train a classifier for later use
python3 python/BugLearn.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json
- Optionally, pass --out some/dir to set the output directory for the trained model.
Note that learning a bug detector from the very small corpus of 50 programs will yield a classifier with low accuracy that is unlikely to be useful. To leverage the full power of DeepBugs, you'll need a larger code corpus, e.g., the JS150 corpus mentioned above.
Finding bugs in one or more source files consists of these two steps:
- Extract code pieces
- Use a trained classifier to identify bugs
node javascript/extractFromJS.js calls --files <list of files>
- contains one or more files to be examined. Code pieces can be extracted from any javascript file (.js) given with path specification relative to the main directory.
- The command produces
calls_*.json
files, which is data suitable for theSwappedArgs
bug detector. For the other bug two detectors, replacecalls
withbinOps
in the above command.
python3 python/BugFind.py --pattern SwappedArgs --threshold 0.95 --model some/dir --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data calls_xx*.json
- The first argument selects the bug pattern.
- 0.95 is the threshold for reporting bugs; higher means fewer warnings of higher certainty.
- --model sets the directory to load a trained model from.
- The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
- The remaining argument is a list of .json files. They contain the data extracted in Step 1.
- The command examines every code piece and writes a list of potential bugs with its probability of being incorrect
The above bug detectors rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped token_to_vector.json
file. Alternatively, you can learn the embeddings via Word2Vec as follows:
- Extract identifiers and tokens:
node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50
- The command produces
tokens_*.json
files.
- Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):
python3 python/TokensToTopTokens.py tokens_*.json
- The arguments are the just created files.
- The command produces
encoded_tokens_*.json
files and a filetoken_to_number_*.json
that assigns a number to each identifier and literal.
- Learn embeddings for identifiers and literals:
python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json
- The arguments are the just created files.
- The command produces a file
token_to_vector_*.json
.