A real-time analytics platform for GitHub events data, providing insights into repository popularity, user activity, and programming language trends.
This project implements a complete streaming data pipeline with the following components:
-
Data Collection
- Python scripts fetch data from GitHub Events API
- Events are published to Apache Kafka topics
-
Stream Processing
- Apache Spark Streaming processes GitHub events in real-time
- Analyzes repository popularity, user activity, and language trends
- Calculates metrics using sliding windows
-
Data Storage
- PostgreSQL database with optimized schema
- Tables for events, repositories, users, and aggregated metrics
-
Visualization
- Grafana dashboards for real-time analytics visualization
github-events-analytics/
├── data-collector/ # Python scripts for GitHub API data collection
├── kafka-setup/ # Kafka configuration and setup files
├── spark-jobs/ # Spark streaming applications
├── database/ # Database schema and migration scripts
├── grafana/ # Grafana dashboard definitions
├── docker/ # Docker compose files for the entire stack
├── .github/ # GitHub Actions workflows for CI/CD
├── start.sh # Script to start all services
├── monitor.sh # Script to monitor service health
├── cleanup.sh # Script to clean up the environment
├── CONTRIBUTING.md # Guidelines for contributing to the project
├── LICENSE # MIT License
└── README.md # This file
The system analyzes the following metrics:
- Repository popularity over time (stars, forks, watches)
- Event type distribution (push, pull request, issue, etc.)
- Active users and their contribution patterns
- Programming language trends and adoption
- Docker and Docker Compose
- Git
- GitHub API token (for higher rate limits)
-
Clone the repository:
git clone https://github.com/javid912/github-events-analytics.git cd github-events-analytics
-
Create a
.env
file in the root directory with the following variables:GITHUB_API_TOKEN=your_github_token POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres POSTGRES_DB=github_events
-
Start the entire stack using the start script:
./start.sh
-
Monitor the health of all services:
./monitor.sh
-
Access the Grafana dashboard:
http://localhost:3000
Default credentials: admin/admin
Python scripts that fetch data from the GitHub Events API and publish to Kafka topics. The collector handles rate limiting, pagination, and error recovery.
Kafka configuration for topic creation, retention policies, and consumer groups.
Spark Streaming applications that process the event stream, calculate metrics, and store results in PostgreSQL.
PostgreSQL database schema with tables for raw events, processed metrics, and aggregated data.
Pre-configured dashboards for visualizing GitHub event metrics and trends.
The project includes comprehensive unit tests for all components:
-
Run all tests:
./run_tests.sh
-
Run specific component tests:
cd data-collector python -m unittest test_github_events_collector.py cd ../spark-jobs python -m unittest test_process_github_events.py
-
Run tests with coverage:
pytest --cov=data-collector --cov=spark-jobs
This project uses GitHub Actions for CI/CD:
- Automated tests run on every push and pull request
- Code quality checks (linting, formatting)
- Docker images are built and pushed to Docker Hub on successful merges to main
To monitor the health of all services:
./monitor.sh
This script checks:
- Service status (running, healthy, unhealthy)
- Kafka topics existence
- PostgreSQL tables
- Grafana availability
- Data flow through the system
-
To stop all services:
docker-compose -f docker/docker-compose.yml down
-
To clean up the environment (stop services, remove volumes and temporary files):
./cleanup.sh
To run individual components during development:
-
Data Collector:
cd data-collector python -m venv venv source venv/bin/activate pip install -r requirements.txt python github_events_collector.py
-
Spark Jobs:
cd spark-jobs python -m venv venv source venv/bin/activate pip install -r requirements.txt python process_github_events.py
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.