Attention! This repository is out of date. Repository and research moved to h1alexbel/sr-detection.
SRdataset is an unlabeled dataset of GitHub repositories containing SRs (sample repositories).
Motivation. During work on models for samples-filter project, we discovered the need for the automation of the dataset building process on remote servers, since we need to collect automatically a number of GitHub repositories of data being productive in our research. In order to do this, we integrated ghminer with a few scripts, and packaged all of that as Docker container.
To build a new version of dataset run this:
docker run --detach --name=srdataset --rm --volume "$(pwd):/srdataset" \
-e "CSV=repos" \
-e "SEARCH_QUERY=<query>" \
-e "START_DATE=2019-01-01" \
-e "END_DATE=2024-05-01" \
-e "HF_TOKEN=xxx" \
-e "INFERENCE_CHECKPOINT=sentence-transformers/all-MiniLM-L6-v2" \
-e "PATS=pats.txt" \
--oom-kill-disable \
abialiauski/srdataset:0.0.1
Where <query>
is the search query to the GitHub API,
2019-01-01
is a start date to search the repositories those were created at
this date, 2024-05-01
is an end to search the repositories those were created
at this date, xxx
is HuggingFace token, required for accessing
inference endpoint in order to generate textual
embeddings; pats.txt
is file contains a number of GitHub PATs.
The building process can take a while. After it completed, you should have these files:
results.csv
with all collected repositories.repos.csv
with all preprocessed and filtered repositories.texts.csv
repository textual metadata used for generating embeddings.text-embeddings.csv
with ready-to-cluster repositories with textual vectors only.similar.csv
with input textual examples and their top-5 most similar analogues from generated embeddings.numerical.csv
with ready-to-cluster repositories with numerical data only.mix.csv
with ready-to-cluster repositories that contain both: numerical and textual vectors.
If you run container with -e PUSH_TO_HF=true
, you should expect that after
preprocess, we will push output CSV files to the files to the profile, passed
to -e HF_PROFILE
, within provided HF_TOKEN
. All outputs will be pushed into
datasets with sr-
prefix.
If you run container with -e "CLUSTER=true"
, you should have one ZIP file
named like clusters-2024-06-21-18:22.zip
and containing these files:
agglomerative
/mix
/members
...
/numerical
/members
...
/textual
/members
...
dbscan/... (the same structure)
gmm/...
kmeans/...
source/...
All experiments are grouped by model name: kmeans
, dbscan
, agglomerative
,
etc. In each that model directory you should have members
directory and a set
of plots. members
contains a set of text files tagged with output cluster
label e.g. 0.txt
. In source
you should have all CSV files that were used to
generate clusters.
Fork repository, make changes, send us a pull request.
We will review your changes and apply them to the master
branch shortly,
provided they don't violate our quality standards. To avoid frustration,
before sending us your pull request please run full make build:
make env test