The goal of the study is to create a model that, by looking at the README file and meta-information, can identify GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency.
Motivation. During the work on CaM project, we were required to filter out repositories with samples. No readily available technique or tool existed that could perform that function, so we conducted research on this very subject.
The repository structured as follows:
- sr-data, module that consists of a set of tasks that filters collected metadata about GitHub repositories.
- sr-train, module for training ML models.
- sr-detector, trained and reusable model for SR detection.
- sr-paper, LaTeX source for a paper on SR detection.
We collect two-fold metadata for each GitHub repository: numerical, and textual. For numerical we collect:
releases
, the number of releases.pulls
, the number of pull requests.issues
, the total number of issues (opened + closed).branches
, the number of branches.
To run this:
just collect
You should expect to have sr-data/experiment/repos.csv
with collected
repositories and their metadata.
To capture smaller amount of repositories you can run this:
just test-collect
You should expect to have sr-data/tmp/test-repos.csv
, with the same structure
as repos.csv
, but smaller.
You can run this step in GitHub Actions: collect.yml.
We filter collected repositories. We remove repositories with empty README file. Then, we convert README file to plain text and check in which languages file is written. We remove repositories with README file that is not fully written in English.
To run this:
just filter repos.csv
You should expect to have sr-data/experiment/after-filter.csv
.
From each README file we extract it's all headings (text after #
).
We remove English stop words from each heading. Then, we apply
lemmatization for each word, filter words with ^[a-zA-Z]+$
regex,
and calculate up to 5 most common words across README headings.
For instance, this README:
# Building web applications in Java with Spring Boot 3
...
## Agenda
...
## Who am I?
...
## Prerequisites
...
## Outcomes
...
## What is Spring?
...
## Resources
...
### Dan Vega
...
### Spring
...
### Documentation
...
### Books
...
### Podcasts
...
### YouTube
...
Will be transformed to:
['spring', 'build', 'web', 'application', 'java']
To run this:
just extract after-filter.csv
You should expect to have sr-data/experiment/after-extract.csv
.
For each repo, we aggregate all top words from README file headings into string. We convert each string to three variants of embeddings: S-BERT, E5, Embedv3.
To run this:
just embed after-extract.csv
You should expect to have three files:
sr-data/experiment/embeddings-s-bert-384.csv
sr-data/experiment/embeddings-e5-1024.csv
sr-data/experiment/embeddings-embedv3-1024.csv
We calculate SR-score from numerical metadata for each repository, and create seven datasets from prepared data:
scores.csv
, dataset with SR-scores;sbert.csv
, dataset from S-BERT-384 embeddings;e5.csv
, dataset from E5-1024 embeddings;embedv3.csv
, dataset from Embedv3 embeddings;scores+sbert.csv
, combination of SR-score and S-BERT-384 embeddings;scores+e5.csv
, combination of SR-score and E5-1024 embeddings;scores+embedv3.csv
, combination of SR-score and Embedv3-1024 embeddings;
To run this:
just datasets
You should expect to have all seven files in sr-data/experiment
directory.
We apply clustering on our previously created datasets. We use the following algorithms:
Each algorithm generates set of clusters for each dataset.
To run this:
just cluster
You should expect to have the following directories inside experiment
directory:
kmeans
agglomerative
dbscan
gmm
Each directory have its subs named after dataset name: e5
, embedv3
,
scores+sbert
, etc. In each subdirectory you should have config.json|txt
file with used model parameters, and clusters
directory with files containing
clustered repositories. Each file, for instance 0.txt
, where 0
is cluster
identifier, hosts list of repositories in OWNER/REPO
format, separated by new
line:
Faceplugin-ltd/FaceRecognition-LivenessDetection-Android
LxxxSec/CTF-Java-Gadget
flutter-youni/flutter_youni_gromore
ax1sX/RouteCheck-Alpha
darksolopic/PasswordManagerGUI
borjavb/bq-lineage-tool
...
Make sure that you have Python 3.10+, just, and npm installed on your
system, fork this repository, make changes, send us a pull request.
We will review your changes and apply them to the master
branch shortly,
provided they don't violate our quality standards. To avoid frustration, before
sending us your pull request please run full build:
just full