This repository contains an assignment designed to test your proficiency with Spark SQL using a flight dataset. You will perform advanced data analysis using Spark DataFrames and SQL queries.
The dataset consists of three CSV files:
- flights.csv: Contains flight details such as flight number, origin, destination, scheduled, and actual departure and arrival times.
- airports.csv: Contains airport information including the airport code, name, and city.
- carriers.csv: Contains airline carrier information including carrier code and carrier name.
You will complete the following four tasks using Spark SQL:
- Flights with the Largest Discrepancy Between Scheduled and Actual Travel Time.
- Most Consistently On-Time Airlines Using Standard Deviation.
- Origin-Destination Pairs with the Highest Percentage of Canceled Flights.
- Carrier Performance Based on Time of Day.
Each task should be implemented using a Spark SQL query and the results should be written to separate CSV files.
- Fork the GitHub repository containing the starter code with the given Github classroom invitation on canvas.
- Open your forked repository in GitHub Codespaces.
- Ensure that the GitHub Codespace has Java and Python installed (both are installed by default in GitHub Codespaces).
The assignment requires PySpark to run Spark jobs. Install it using pip.
In the GitHub Codespace terminal, run the following command:
pip install pysparkThe dataset (flights.csv, airports.csv, carriers.csv) is already included in the repository. You don't need to load it separately. The Python code provided will read the CSV files directly.
- Open the provided Python code files.
- Implement the Spark SQL queries for each task in the corresponding function.
- Each task has its own function in the provided boilerplate code, and you need to write the SQL queries for each task.
- Objective: Calculate the difference between the scheduled and actual travel time for each flight and identify flights with the largest discrepancies.
- Instructions:
- Write the SQL query to calculate the scheduled and actual travel times.
- Compute the absolute difference between these times and find the largest discrepancies.
- Use windowing to rank flights by discrepancy.
Expected Output:
Task 1 output written to output/task1_largest_discrepancy.csv- Objective: Calculate the standard deviation of departure delays for each carrier and rank them based on their consistency.
- Instructions:
- Write a SQL query to calculate the standard deviation of departure delays.
- Rank airlines by the lowest standard deviation (most consistent).
- Ensure to include only carriers with more than 100 flights.
Expected Output:
Task 2 output written to output/task2_consistent_airlines.csv- Objective: Calculate the cancellation rate for each route and rank them by the highest cancellation percentage.
- Instructions:
- Write a SQL query to find the percentage of canceled flights for each origin-destination pair.
- Rank the pairs by the highest cancellation rate.
- Join the result with the airports dataset to show the full airport names.
Expected Output:
Task 3 output written to output/task3_canceled_routes.csv- Objective: Analyze the performance of airlines at different times of the day.
- Instructions:
- Group flights into morning, afternoon, evening, and night based on their scheduled departure times.
- For each carrier and time of day, calculate the average departure delay.
- Rank airlines by their performance in each time group.
Expected Output:
Task 4 output written to output/task4_carrier_performance_time_of_day.csvAfter implementing the tasks, you need to execute the Python file using spark-submit.
Run the following command in the terminal:
spark-submit <your_python_script.py>Replace <your_python_script.py> with the name of the Python script you are running.
Once you have completed the tasks and generated the output files:
- Add the changes to your GitHub repository:
git add .
git commit -m "Completed Spark SQL Assignment"
git push origin main- Ensure that all code and output files (CSV files) are pushed to your repository.
- Correctness: Your SQL queries must correctly handle the objectives of each task.
- Output: The CSV output for each task must be correct and match the instructions.
- Use of Spark SQL: Ensure that the queries utilize Spark SQL effectively and demonstrate proper usage of SQL functions (aggregations, window functions, joins, etc.).
If you have any questions or run into issues, feel free to reach out during office hours or post your queries in the course discussion forum.
Good luck, and happy coding!