Awesome Machine Learning On Source Code

A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.

Digests

Articles

Papers

Topic modeling of public repositories at scale using names in source code
Parameter-Free Probabilistic API Mining across GitHub
A Subsequence Interleaving Model for Sequential Pattern Mining
A Convolutional Attention Network for Extreme Summarization of Source Code
Parameter-Free Probabilistic API Mining across GitHub
Tailored Mutants Fit Bugs Better
TASSAL: Autofolding for Source Code Summarization
Suggesting Accurate Method and Class Names
Mining idioms from source code
Mining Source Code Repositories at Massive Scale using Language Modeling
Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
Latent Predictor Networks for Code Generation - Addresses the problem of generating programming code from a mixed natural language and structured specification. Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom.
Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav.
Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin.
Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel.
Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach.
DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.
Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel.
Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli.
Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton.
Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau.
Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow.
TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow.
Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka.
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao.
Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy.
Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever.
Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna.
Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas.
Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin.
A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen.
From Machine Learning to Machine Reasoning - Bottou, Leon.
Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio.
Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow.
Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov.
Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever.
API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou.
Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA.
Program Synthesis from Natural Language Using Recurrent Neural Networks - University of Washington CSE, Seatle, WA, USA.
Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen.
Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang.
Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] Yaqin Zhou and Asankhaya Sharma.

Posts

Talks

Software

Machine Learning

Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
vecino - Finds similar Git repositories.
apollo - Source code deduplication as scale, research.
gemini - Source code deduplication as scale, production.
enry - Insanely fast file based programming language detector.
Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.

Utilities

go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
hercules - Git repository mining framework with batteries on top of go-git.
bblfsh - Self-hosted server for source code parsing.
engine - Scalable and distributed data retrieval pipeline for source code.
minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.

Datasets

GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos.
GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
150k Python Dataset - Dataset consisting of 150'000 Python ASTs.
150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs.
card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE.md		LICENSE.md
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Machine Learning On Source Code

Contents

Digests

Articles

Papers

Posts

Talks

Software

Machine Learning

Utilities

Datasets

Credits

Contributions

License

About

Releases

Packages

License

Sorenly/awesome-machine-learning-on-source-code

Folders and files

Latest commit

History

Repository files navigation

Awesome Machine Learning On Source Code

Contents

Digests

Articles

Papers

Posts

Talks

Software

Machine Learning

Utilities

Datasets

Credits

Contributions

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Packages