language-dataset

A dataset for programming language identification.

Methodology

Available languages are fetched from github/linguist's languages.yml and acmeism/RosettaCodeData's Lang.yaml.
For each language, initial samples are fetched from GitHub as follows:
- GitHub Search API is used to get a list of repositories.
- Each repository is cloned and languages are detected with github/linguist.
- One sample is added from each repository.
Samples are later reviewed by humans.

Rules for sample inclusion are:

No more than one sample from each repository.
Sample is at least 500b and at most 100kb.

Dataset

The dataset is stored in the data directory. It contains:

meta.yml: metadata about the dataset and available languages.
dataset.yml: collection of all samples, with pointers sample paths relative to data.

Check a summary of the dataset at REPORT.md.

Contributing

See CONTRIBUTING.md.

Tooling

The tools directory contains various Python utilities to maintain the dataset:

tools/gen_meta.py: Generates data/meta.yml. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.
tools/harvest.py: Fetches samples from GitHub.
tools/vote.py: Updates the vote annotation.
tools/lint.py: Checks the dataset for potential problems.
tools/prepare_commit.py: Updates generated files, required before any commit.
tools/classify_linguist.py: Updates linguist labels.
tools/classify_pygments.py: Updates pygments labels.

To run tools first create the virtual environment:

pip install poetry
poetry install

Then run the tool with python -m:

poetry run python -m tools.gen_meta

License

Each sample in data has its own license. Check the origin repository for details.

Everything else is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github		.github
data		data
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

language-dataset

Methodology

Dataset

Contributing

Tooling

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

smola/language-dataset

Folders and files

Latest commit

History

Repository files navigation

language-dataset

Methodology

Dataset

Contributing

Tooling

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages