Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dataset descriptions #106

Merged
merged 1 commit into from
Feb 27, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
# datasets [![Build Status](https://travis-ci.org/src-d/datasets.svg?branch=master)](https://travis-ci.org/src-d/datasets) [![Build status](https://ci.appveyor.com/api/projects/status/b2en9yo9142qgadh?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/datasets)
# source{d} Datasets [![Build Status](https://travis-ci.org/src-d/datasets.svg?branch=master)](https://travis-ci.org/src-d/datasets) [![Build status](https://ci.appveyor.com/api/projects/status/b2en9yo9142qgadh?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/datasets)

source{d} datasets for source code analysis and machine learning on source code.
source{d} datasets for source code analysis and [machine learning on source code (ML on Code)](https://github.com/src-d/awesome-machine-learning-on-source-code).

This repository contains all the needed tools and scripts to reproduce the datasets.
This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.

List of available datasets:
## Available datasets

* [Public Git Archive](PublicGitArchive)
### Public Git Archive

- [Public Git Archive](PublicGitArchive)
- Size: 3TB
- Description: 180k+ top-bookmarked repositories from GitHub, consisting of 54M+ files and ~16 billion lines of code over 450+ programming languages.

### Programming Language Identifiers

- [Programming Language Identifiers](Identifiers)
- Size: 1GB
- Description: ~49M distinct identifiers extracted from 10+ programming languages.

## Contributions

Expand All @@ -16,4 +25,3 @@ Contributions are very welcome, please see [CONTRIBUTING.md](CONTRIBUTING.md) an
## License

The tools and scripts are licensed under Apache 2.0, see [LICENSE.md](LICENSE.md).