This project focuses on analyzing GitHub events data to gain insights and track activities within GitHub repositories. It utilizes Java codes to download data from the GitHub Archive site, processes it through a Kafka pipeline, and stores it in Elasticsearch for further analysis. The resulting data is visualized using Kibana dashboards, providing users with a comprehensive overview of GitHub activities.
- DataBase: Elasticsearch
- Streaming System: Redpanda
- Dashboard: Kibana
- Monitoring: Redpanda console and Elasticvue
GitHub hosts a vast amount of data regarding user activities, repository changes, and community interactions. Analyzing this data efficiently can provide valuable insights into project trends, community engagement, and developer behavior. However, managing and processing large volumes of GitHub event data in real-time poses significant challenges. This project addresses these challenges by providing a streamlined pipeline for ingesting, processing, and visualizing GitHub events data.
To ensure scalability and accessibility, this project leverages cloud services for hosting Elasticsearch and Kibana nodes. By deploying Elasticsearch and Kibana in the cloud, users can access the Kibana dashboard from anywhere via the internet, facilitating easy data visualization and analysis.
The data ingestion process begins by fetching GitHub events data from the GitHub Archive site using Java code. This data is then streamed to a Redpanda node using the Java Kafka client. Redpanda provides a high-performance, Kafka-compatible streaming platform, ensuring reliable and efficient data transfer from the source to the destination.
Elasticsearch serves as the primary database for storing and indexing the GitHub events data. Leveraging Elasticsearch's powerful indexing and search capabilities, the processed data becomes easily searchable and retrievable. Additionally, Elasticsearch's integration with Kibana enables seamless data visualization and exploration through custom dashboards and visualizations.
Once the GitHub events data is published into Redpanda, a Java code consumes the data from the designated topic in Redpanda. This code performs necessary transformations to convert the raw event data into Java objects containing relevant information. These transformed records are then indexed in Elasticsearch, ensuring that the data is structured and ready for analysis.
Kibana plays a crucial role in visualizing and analyzing the processed GitHub events data. A custom dashboard is created within Kibana to provide users with a comprehensive view of GitHub activities. Leveraging Kibana's intuitive interface and visualization tools, users can explore trends, track changes, and gain insights into various aspects of GitHub repositories and community interactions.
After creating the following dashboard, I attempted to compare one of the most interesting results with this GitHub report about the state of open source in 2023. Although I only used GitHub events data for a single day, the comparison yielded striking similarities. In the dashboard, the top 5 most used programming languages were observed to be 1. Typescript, 2. Python, 3. Javascript, 4. Java, and 5. Golang. The GitHub report had a similar result, with one difference: Golang was ranked 10th, and C# was in 5th place. However, in my analysis, C# occupied the 6th place. It's worth noting that while the GitHub report covers a full year of data, my analysis was based on a single day's worth of data. Nonetheless, the remarkable similarity between the two datasets is both intriguing and fascinating.
you can check the kibana dashboard using this link:
- dashboard url: https://de-zoomcamp-project2-dashboard.darkube.app
- username:
visitor - password:
visitorpassword
- install docker / open docker desktop. install Maven.
- clone this repo using the below command
git clone https://github.com/shayansm2/DE-zoomcamp-playground.git- go to the project folder
cd DE-zoomcamp-playground/github-events-analyzer- run the following docker command
docker compose up -d- install maven dependencies with this command
mvn clean install- go to the java project directory and run the kafka producer
cd src/main/java/org/example/
javac GitHubEventsKafkaPublisher.java
java GitHubEventsKafkaPublisher- run the kafka consumer and elasticsearch indexer
javac GitHubEventElasticSearchIndexer.java
java GitHubEventElasticSearchIndexer- checkout the project using this table
| address | usage |
|---|---|
| localhost:9092 | redpanda node |
| localhost:8080 | redpanda console |
| localhost:9200 | elasticsearch node |
| localhost:5601 | kibana (local) |
| elasticvue | elasticvue site for downloading its app or its browser extension |
| https://data.gharchive.org/{year}-{month}-{day}-{hour}.json.gz | downloading github events data fro a specific time |
| https://de-zoomcamp-project2-dashboard.darkube.app | kibana (cloud) |
have in mind that you should have a running docker in your system for building this project.





