A dataset for programming language identification.
- Available languages are fetched from github/linguist's languages.yml and acmeism/RosettaCodeData's Lang.yaml.
- For each language, initial samples are fetched from GitHub as follows:
- GitHub Search API is used to get a list of repositories.
- Each repository is cloned and languages are detected with github/linguist.
- One sample is added from each repository.
- Samples are later reviewed by humans.
Rules for sample inclusion are:
- No more than one sample from each repository.
- Sample is at least 500b and at most 100kb.
The dataset is stored in the data directory. It contains:
meta.yml: metadata about the dataset and available languages.dataset.yml: collection of all samples, with pointers sample paths relative todata.
Check a summary of the dataset at REPORT.md.
See CONTRIBUTING.md.
The tools directory contains various Python utilities to maintain the dataset:
tools/gen_meta.py: Generatesdata/meta.yml. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.tools/harvest.py: Fetches samples from GitHub.tools/vote.py: Updates thevoteannotation.tools/lint.py: Checks the dataset for potential problems.tools/prepare_commit.py: Updates generated files, required before any commit.tools/classify_linguist.py: Updates linguist labels.tools/classify_pygments.py: Updates pygments labels.
To run tools first create the virtual environment:
pip install poetry
poetry install
Then run the tool with python -m:
poetry run python -m tools.gen_meta
Each sample in data has its own license. Check the origin repository for details.
Everything else is licensed under the MIT License.