This tool aimes to predict relevant GitHub topics for repositories by analyzing their content. It collects repository data via the GitHub API
, processes descriptive text and README
files, and utilizes a BERT-based multi-label classifier
to suggest appropriate topics. The system includes complete data collection and model training pipelines, with support for exporting trained models to ONNX format
for deployment.
Data Collection Pipeline - Sample Database
- Collects GitHub repository data (metadata, topics, READMEs) via GitHub API
- Analyzes repository content to predict relevant topics using ML models
- Trains a BERT-based multi-label classifier for topic prediction
- Stores repository and topic data in SQLite for efficient retrieval
- Exports trained models to ONNX format for production deployment
git clone https://github.com/Namgyu-Youn/topicgen.git
cd topicgen
curl -sSL https://install.python-poetry.org | python3 - # Optional
poetry install
# Data Collection Pipeline
poetry run python -m topicgen.pipelines.data_collection_pipeline --min-stars 1000 --language python --max-repos 500
# Model Training Pipeline
poetry run python -m topicgen.pipelines.model_training_pipeline --base-model bert-base-uncased --num-epochs 5
# Build the Docker image
docker build -t github-topic-generator .
# Run data collection pipeline
docker run github-topic-generator python -m topicgen.pipelines.data_collection_pipeline
# Run model training pipeline
docker run github-topic-generator python -m topicgen.pipelines.model_training_pipeline
python -m venv env
# On Windows
env\Scripts\activate
# On macOS/Linux
source env/bin/activate
pip install -r requirements.txt
# Data Collection Pipeline
python -m topicgen.pipelines.data_collection_pipeline
# Model Training Pipeline
python -m topicgen.pipelines.model_training_pipeline
- Enter GitHub URL
- Select the main, sub category that best matches your repository
- Click "Generate Topics" to get your results
- Enjoy generated topics('#')! It can be used like this.
👥 Contribution guide : CONTRIBUTING.md
Thanks for your interest. I always enjoy meaningful collaboration.
Do you have any question or bug?? Then submit ISSUE. You can also use awesome labels(🏷️).