Welcome to the Weather Data Pipeline repository! This project combines Apache Airflow and Apache Spark to create a robust ETL (Extract, Transform, Load) pipeline. It pulls weather data from the OpenWeather API and stores it in a PostgreSQL database.
- Introduction
- Features
- Technologies Used
- Getting Started
- Setup Instructions
- Usage
- Contributing
- License
- Contact
In today’s data-driven world, having access to reliable weather data can significantly impact various industries. This project aims to streamline the process of gathering and processing weather data using modern technologies. The pipeline automates the data flow, making it easier to analyze and visualize weather trends.
- Automated Data Collection: Fetches real-time weather data from the OpenWeather API.
- Data Transformation: Processes and cleans the data using Apache Spark.
- Data Storage: Stores the processed data in a PostgreSQL database.
- Scheduling: Uses Apache Airflow to schedule and monitor ETL tasks.
- Dockerized Environment: Simplifies deployment with Docker and Docker Compose.
- Apache Airflow: For orchestrating the ETL pipeline.
- Apache Spark: For big data processing and transformation.
- PostgreSQL: For storing the weather data.
- Docker: For containerization.
- Python: For scripting and data manipulation.
- OpenWeather API: For accessing weather data.
To get started with the Weather Data Pipeline, follow the setup instructions below. Make sure you have Docker and Docker Compose installed on your machine.
-
Clone the Repository
Open your terminal and run:git clone https://github.com/loheshRK/weather-data-pipeline.git cd weather-data-pipeline
-
Create a
.env
File
Create a.env
file in the root directory and add your OpenWeather API key:OPENWEATHER_API_KEY=your_api_key_here
-
Build and Run the Docker Containers
Use Docker Compose to build and run the containers:docker-compose up --build
-
Access Apache Airflow
Once the containers are running, you can access the Airflow web interface athttp://localhost:8080
. The default credentials are:- Username:
airflow
- Password:
airflow
- Username:
-
Configure Your DAG
Navigate to thedags
folder in your project and configure your Directed Acyclic Graph (DAG) as needed.
After setting up the environment, you can start the ETL process by triggering the DAG from the Airflow interface. The pipeline will automatically fetch the weather data, process it, and store it in the PostgreSQL database.
To view the data, you can connect to your PostgreSQL database using any SQL client.
We welcome contributions to enhance the Weather Data Pipeline. If you want to contribute, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes.
- Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or feedback, feel free to reach out:
- GitHub: loheshRK
- Email: your-email@example.com
Don't forget to check the Releases section for updates and new features!