A curated list of awesome machine learning frameworks and algorithms that work on top of source code. Inspired by Awesome Machine Learning.
- Topic modeling of public repositories at scale using names in source code
- Parameter-Free Probabilistic API Mining across GitHub
- A Subsequence Interleaving Model for Sequential Pattern Mining
- A Convolutional Attention Network for Extreme Summarization of Source Code
- Parameter-Free Probabilistic API Mining across GitHub
- Tailored Mutants Fit Bugs Better
- TASSAL: Autofolding for Source Code Summarization
- Suggesting Accurate Method and Class Names
- Mining idioms from source code
- Mining Source Code Repositories at Massive Scale using Language Modeling
- Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code
- Latent Predictor Networks for Code Generation - Addresses the problem of generating programming code from a mixed natural language and structured specification. Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom.
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav.
- Using recurrent neural networks to predict next tokens in the java solutions - Alex Skidanov, Illia Polosukhin.
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel.
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach.
- DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow.
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel.
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli.
- Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton.
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau.
- Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow.
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow.
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka.
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao.
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy.
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever.
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna.
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas.
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin.
- A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen.
- From Machine Learning to Machine Reasoning - Bottou, Leon.
- Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio.
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow.
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov.
- Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou.
- Summarizing Source Code using a Neural Attention Model University of Washington CSE, Seatle, WA, USA.
- Program Synthesis from Natural Language Using Recurrent Neural Networks - University of Washington CSE, Seatle, WA, USA.
- Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen.
- Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.
- Automated Identification of Security Issues from Commit Messages and Bug Reports [PDF] Yaqin Zhou and Asankhaya Sharma.
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- hercules - Git repository mining framework with batteries on top of go-git.
- bblfsh - Self-hosted server for source code parsing.
- engine - Scalable and distributed data retrieval pipeline for source code.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repos.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150'000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150'000 JavaScript files and their parsed ASTs.
- card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.
- A lot of references and articles were taken from mast-group
See CONTRIBUTING.md.