Skip to content

Analyze GitHub repository based on pre-trained model, then generate related topics('#').

License

Notifications You must be signed in to change notification settings

namgyu-youn/topicgen

Repository files navigation

topicgen

This tool aimes to predict relevant GitHub topics for repositories by analyzing their content. It collects repository data via the GitHub API, processes descriptive text and README files, and utilizes a BERT-based multi-label classifier to suggest appropriate topics. The system includes complete data collection and model training pipelines, with support for exporting trained models to ONNX format for deployment.

Project Structure

Project Overview

project_architecture

Data Collection Pipeline - Sample Database

data_collection_pipeline

Model Training Pipeline

training_pipeline

✨ Features

  • Collects GitHub repository data (metadata, topics, READMEs) via GitHub API
  • Analyzes repository content to predict relevant topics using ML models
  • Trains a BERT-based multi-label classifier for topic prediction
  • Stores repository and topic data in SQLite for efficient retrieval
  • Exports trained models to ONNX format for production deployment

🚩 How to use?

git clone https://github.com/Namgyu-Youn/topicgen.git
cd topicgen

Option 1: Using Poetry (Highly Recommended)

curl -sSL https://install.python-poetry.org | python3 - # Optional
poetry install

# Data Collection Pipeline
poetry run python -m topicgen.pipelines.data_collection_pipeline --min-stars 1000 --language python --max-repos 500

# Model Training Pipeline
poetry run python -m topicgen.pipelines.model_training_pipeline --base-model bert-base-uncased --num-epochs 5

Option 2: Using Docker

# Build the Docker image
docker build -t github-topic-generator .

# Run data collection pipeline
docker run github-topic-generator python -m topicgen.pipelines.data_collection_pipeline

# Run model training pipeline
docker run github-topic-generator python -m topicgen.pipelines.model_training_pipeline

Option 3: Standard Python Setup

python -m venv env

# On Windows
env\Scripts\activate
# On macOS/Linux
source env/bin/activate

pip install -r requirements.txt

# Data Collection Pipeline
python -m topicgen.pipelines.data_collection_pipeline

# Model Training Pipeline
python -m topicgen.pipelines.model_training_pipeline

🧐 Introduction about gradio UI

  1. Enter GitHub URL
  2. Select the main, sub category that best matches your repository
  3. Click "Generate Topics" to get your results
  4. Enjoy generated topics('#')! It can be used like this.

👥 Contribution guide : CONTRIBUTING.md

Thanks for your interest. I always enjoy meaningful collaboration.
Do you have any question or bug?? Then submit ISSUE. You can also use awesome labels(🏷️).