This repository contains the necessary scripts and partial results to reproduce the experiments of a paper presented to a conference. Name of the conference has been omitted on purpose to preserve the anonimity of the authors.
Pipeline to identify the presence of software in Github repositories. Also, classify the software in different categories: Workflows, Notebooks, Libraries, Services, Scripts, Benchmark and Others.
Python >= 3.9
This pipeline is composed by the following scripts:
- add_github_metadata.py: This script takes a list of GitHub repositories from a CSV file and creates a new CSV file with the following metadata fields from GitHub: programming languages used, description, and topics.
- multilabel_classifier.py: This script classifies the software into the following categories: Workflow, Benchmark, Library, and Other. It has been trained using a DistilBERT model and uses the description field provided by GitHub.
- create_classification.py: This script classify the software based on the file extensions and the text classifier model.
Another additional scripts are presented in the repository to reproduce the experiments:
- calculate_distribution: This script calculates the distribution of the categories identified in the repositories.
- calculate_topics: This script identifies the topics associated with the repositories selected in our study and calculates their frequency.
- corpus_classifier_constructor: This script builds a corpus based on the categories used by the classifier. These categories are derived from the detected topics. Fifty samples from each category were selected. This script generates an initial version of the corpus, and a second corpus is produced after manual validation.
- insertData.py: This script uploads the OpenAIRE Research Graph dump for software artifacts to a MongoDB database.
- corpora/corpus_to_annotate: It is the corpus generated by the script corpus_classifier_construction. It explores the most frequent topics of software types.
- corpora/corpus_classifier: It is the corpus used to train and evaluate the multilabel classifier. It has been generated after the manual validation of the corpus corpus_to_annotate.
- datasets/github_openaire.csv: This csv contains the Github repositories present in the OpenAire Research Graph.
- datasets/languages.yml: It is the database (yaml file) of the Linguistics library where specify the type of file based on its extension.
- resuts/metadata_github_list.csv: It is a CSV with the metadata extracted from Github repositories.
- results/software_classification: It is the final classification of the GIthub repositories.
- results/topcis_distribution: List of topics of the selected repositories. Frequency can be found together with the topic.
First, you need to download the repository.
git clone https://github.com/oeg-upm/software_classification
To execute the pipeline, some specific libraries must be installed.
pip install requierements.txt
An important note is that the files use standardized names, so they can only be modified by editing the source code and updating the appropriate variable.
The first step is to integrate the metadata from GitHub. The selected repositories are listed in the file github_openaire.csv. It is important to always use codeRepositoryUrl as the header.
python src/add_github_metadata.py
A metadata_github_list.csv file will be generated. An example can be found in the results directory. This file contains the following fields: name, description, language, stars, url, topics, github_url, and languages. language contains the main language found, while languages lists the different languages detected by the Linguistics library. The language categories are programming languages, data, prose, and markdown. These languages will help determine whether the repository is software or not.
The second step is to categorize these languages. For this, we will need the language database provided by the Linguistics library (datasets/languages.yml).
This script requires a GitHub token. Please replace the token in the code (variable token).
python src/corpus_classifier_construction.py
After execution, a file named corpus_to_annotate.csv will be generated. The structure of this file includes github_url, description, Library, Benchmark, Service, Workflow, and Other. These columns represent the categories, with a value of 1 indicating presence and 0 indicating absence.
A new corpus must be generated if you want to train the model. The name of this corpus is corpus_for_classifier.csv, and an example can be found in the corpora folder. This new corpus does not contain the github_url field, so it needs to be removed after manual validation.
To train the model, run the multilabel_classifier script.
python src/multilabel_classifier.py
Remember that you need the corpus_for_classifier.csv file, which you can find in the corpora folder.
The results of the training and evaluation process will be saved in a folder named results_classifier. The model and tokenizer will be available in the multilabel_software_classifier folder.
Finally, to generate the classification of the software, execute:
python create_classification.py
A file named software_classification.csv will be generated. The classification can be found in the types_software column in JSON format. For example, "{'Library': 0, 'Benchmark': 0, 'Service': 0, 'Workflow': 1, 'Script': 1, 'Notebook': 0, 'Other': 0}" indicates that a workflow and scripts have been detected.