Welcome to the GitHub Repository Recommender System! This project is designed to fetch data from GitHub repositories, preprocess it, and use various algorithms to recommend repositories to users based on their preferences. Below is a detailed guide on how to set up, run, and understand the project.
This project aims to provide a robust recommender system for GitHub repositories. It involves fetching repository data, preprocessing the data, extracting relevant keywords, and generating recommendations based on similarity metrics.
- Data Fetching: Retrieve repository data, README content, and issues/labels from GitHub.
- Data Preprocessing: Clean and preprocess the fetched data.
- Keyword Extraction: Extract keywords using TF-IDF, LDA, and BERT.
- Similarity Calculation: Compute similarity between user preferences and repository features.
- Recommendations: Generate and display repository recommendations for users.
Ensure you have the following installed:
- Python 3.7+
- Git
- Virtual Environment (optional but recommended)
- Clone the repository:
git clone https://github.com/your-username/github-recommender-system.git cd github-recommender-system
- Install the required python packages:
pip install -r requirements.txt
- Add your github token to the enviroment:
export GITHUB_TOKEN='your_github_token'
- Fetch Repository Data: Use
fetch_repo_data.py
to gather repository metadata, README content, languages, and topics. - Fetch Issue Labels: Use
fetch_issue_labels.py
to scrape issue labels from repository pages. - Fetch Trending Repositories: Use
fetch_trending_repos.py
to get trending repositories based on language and spoken language. - Fetch Trending Metadata: Use
fetch_trending_repos_metadata.py
to gather metadata for trending repositories. - Fetch Trending Issues Labels: Use
fetch_trending_issues_labels.py
to scrape issue labels for trending repositories.
- Preprocess Data: Clean and preprocess the README content and issues.
- Extract Keywords: Use TF-IDF, LDA, and BERT to extract relevant keywords from the README and issues.
- Vectorize Data: Transform the preprocessed data into vectors using TF-IDF.
- Compute Similarity: Calculate cosine similarity between user preferences and repository vectors.
- Generate Recommendations: Recommend repositories to users based on the highest similarity scores.