srdataset

Attention! This repository is out of date. Repository and research moved to h1alexbel/sr-detection.

SRdataset is an unlabeled dataset of GitHub repositories containing SRs (sample repositories).

Motivation. During work on models for samples-filter project, we discovered the need for the automation of the dataset building process on remote servers, since we need to collect automatically a number of GitHub repositories of data being productive in our research. In order to do this, we integrated ghminer with a few scripts, and packaged all of that as Docker container.

How to use

To build a new version of dataset run this:

docker run --detach --name=srdataset --rm --volume "$(pwd):/srdataset" \
  -e "CSV=repos" \
  -e "SEARCH_QUERY=<query>" \
  -e "START_DATE=2019-01-01" \
  -e "END_DATE=2024-05-01" \
  -e "HF_TOKEN=xxx" \
  -e "INFERENCE_CHECKPOINT=sentence-transformers/all-MiniLM-L6-v2" \
  -e "PATS=pats.txt" \
  --oom-kill-disable \
  abialiauski/srdataset:0.0.1

Where <query> is the search query to the GitHub API, 2019-01-01 is a start date to search the repositories those were created at this date, 2024-05-01 is an end to search the repositories those were created at this date, xxx is HuggingFace token, required for accessing inference endpoint in order to generate textual embeddings; pats.txt is file contains a number of GitHub PATs.

The building process can take a while. After it completed, you should have these files:

results.csv with all collected repositories.
repos.csv with all preprocessed and filtered repositories.
texts.csv repository textual metadata used for generating embeddings.
text-embeddings.csv with ready-to-cluster repositories with textual vectors only.
similar.csv with input textual examples and their top-5 most similar analogues from generated embeddings.
numerical.csv with ready-to-cluster repositories with numerical data only.
mix.csv with ready-to-cluster repositories that contain both: numerical and textual vectors.

If you run container with -e PUSH_TO_HF=true, you should expect that after preprocess, we will push output CSV files to the files to the profile, passed to -e HF_PROFILE, within provided HF_TOKEN. All outputs will be pushed into datasets with sr- prefix.

If you run container with -e "CLUSTER=true", you should have one ZIP file named like clusters-2024-06-21-18:22.zip and containing these files:

agglomerative
  /mix
    /members
    ...
  /numerical
    /members
    ...
  /textual
    /members
    ...
dbscan/... (the same structure)
gmm/...
kmeans/...
source/...

All experiments are grouped by model name: kmeans, dbscan, agglomerative, etc. In each that model directory you should have members directory and a set of plots. members contains a set of text files tagged with output cluster label e.g. 0.txt. In source you should have all CSV files that were used to generate clusters.

How to contribute

Fork repository, make changes, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards. To avoid frustration, before sending us your pull request please run full make build:

make env test

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
steps		steps
tests		tests
.0pdd.yml		.0pdd.yml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pdd		.pdd
.rultor.yml		.rultor.yml
.shellcheckrc		.shellcheckrc
.yamllint.yml		.yamllint.yml
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
renovate.json		renovate.json
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

srdataset

How to use

How to contribute

About

Releases

Packages

Contributors 3

Languages

License

h1alexbel/srdataset

Folders and files

Latest commit

History

Repository files navigation

srdataset

How to use

How to contribute

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages