A Python-based web scraper that extracts GitHub topics and their top repositories, then saves the data into structured CSV files for further analysis.
This project scrapes GitHub Topics from the GitHub Topics Page and retrieves information about top repositories under each topic.
The scraped data is then saved as CSV files inside the Data/ directory.
This project was built using the following technologies:
- Python - The core programming language
- BeautifulSoup - For parsing and extracting HTML data
- Requests - To send HTTP requests and fetch web pages
- Pandas - For data manipulation and exporting to CSV
- Extracts a list of top GitHub topics (Machine Learning, Data Science, etc.)
- Retrieves top repositories under each topic
- Extracts repository details (username, repo name, stars, repo URL)
- Saves all data into CSV files for easy analysis
- Implements logging to track scraping progress and errors
βββ GitHub_Scraper.ipynb # Jupyter Notebook containing the scraper
βββ LOGS/ # Stores log files for debugging
βββ Data/ # Stores CSV files with scraped data
βββ README.md # Project documentation
- Scrape GitHub Topics - The script fetches the top topics from GitHub and extracts their names, descriptions, and URLs.
- Scrape Top Repositories - For each topic, the script scrapes top repositories, extracting details like username, repo name, star count, and repo URL.
- Save Data to CSV - Extracted data is stored in structured CSV files, one for each topic.
- Logging for Debugging - A logging system is implemented to track the scraping progress and handle errors.
Example 3D topic data stored in Data/3d.csv:
git clone https://github.com/Shruti-H/github-topic-repo-scraper.git
cd github-topic-repo-scraperjupyter notebook scraping-github-topics-repositories.ipynbIf Jupyter Notebook is not installed, install it using:
pip install notebook- Open
scraping-github-topics-repositories.ipynbin Jupyter Notebook. - Click "Run All" to execute all the cells.
- Further analyze CSV data using Pandas, SQL, or visualization tools.
- Improve scraping speed using parallel processing.
- Add database storage for structured data retrieval.
- Extract additional repository details (Forks, Issues, Contributors).
- Implement pagination to scrape more topics and repositories by iterating through multiple pages (
?page=1,?page=2, etc.).
