Repo Classifier is a web application realized with Django that uses Machine Learning algorithms to classify GitHub repositories into several categories. This project is developed by the team of the University of Stuttgart for the informatiCup2017.
- python 3.5+
- pip
- yarn
- Create a virtual environment with
python -m venv .venv - Activate the venv with
source .venv/bin/activateor installautoenvand create a.envfile as described below - Install the dependencies with
pip install -r requirements.txt - Setup the python SDK to
.venv/bin/pythonto also access the dependencies in the venv - Install web dependencies with
yarn - Copy the file
github.conf.templatetogithub.confand set your GitHub token or credentials
- Install a dependency in the venv with
pip install DEPENDENCY - Save a snapshot of the dependencies with
pip freeze > requirements.txt
After autoenv is installed you need to configure your bash to automatically use it. There are different ways to use it depending on the shell you use. For example
Simply create a file called .env in the root directory of the repository and paste this in it for Linux.
source .venv/bin/activate
echo "repo-classifier - NICE! (venv on)"
For classification we expect a file containing one line for each repository url.
https://github.com/briantemple/homeworkr
https://github.com/angular
...
For training we need a file which contains the label for a url. We support two formatted inputs for training:
Separated with a simple space.
https://github.com/briantemple/homeworkr DEV
...
or separated with a comma and first line will be ignored (becaused we used the django export):
url,category
https://github.com/owncloud/documentation,DOCS
...
- Run
python manage.py migrateto setup the database. - Execute
python manage.py runserverto start the live reloading webserver on localhost:8000. - You can now run the classifier by uploading a file and pressing
Classify. - (Optional) You can train a model by uploading a file as described above and pressing
Train.
After the complete setup described above open a terminal and cd into the root directory of the project.
Then just execute python main.py and a help output will guide you with the possible parameters.
For simple classification just run python main.py -f "path/to/file" -c which will run the classification (Attention: the model has to be available to run the classification).
If training is needed just replace the -c with -t: python main.py -f "path/to/file" -t
Klassifikation funktioniert nicht. Konsole wirft Fehler mit nicht passenden Featurezahlen oder ähnliches.
Das schließt darauf hin, dass sich die Featurereihenfolge oder die Featureanzahl verändert hat und ein altes Classification model genutzt. Für den Fall muss man das Training neu starten und dadurch ein neues model generieren lassen, dass dann verwendet werden kann.
Im root Verzeichnis befinden sich die db und ein model.
Die DB beinhaltet dabei gecachte Werte für Features und Repositories und erleichtert so neues Trainieren etc, da keine Request gesendet werden müssen.
Das Model ist das Klassifizierungsmodel und beinhaltet das Training, wodurch die Klassifizierung überhaupt erst ermöglicht wird.
Diese Dateien können auch kopiert und wiederverwendet werden, solange sich nichts an den Features selbst verändert.