This was forked from the repository for the LinkedIn Learning course High-Performance PySpark: Advanced Strategies for Optimal Data Processing. The full course is available on LinkedIn Learning.
I edited the repository to add more code that is also related to Pyspark.
- Data Cleaning
- Defining Schema
- Compression Techniques
- Repartitioning
- Clustering Model