Skip to content

divithraju/divith-raju-PySpark-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: PySpark Data Processing and Code Generation Suite

This repository contains a collection of PySpark applications designed for common data processing tasks, along with a code generation tool for simplifying Spark join operations. The suite demonstrates practical data engineering techniques using PySpark, making it ideal for those interested in big data, distributed computing, and data transformations. Whether you're categorizing data, analyzing trends, or generating code, this project showcases real-world examples applicable across industries.

Contents

1.Age Categorization Using UDFs: Categorizes ages into groups ('Youth', 'Adult', 'Senior') using a User Defined Function (UDF) in PySpark.

2.Top 3 Movies Based on Ratings: Analyzes movie ratings from different users and finds the top 3 movies based on their average rating.

3.Unique Website Visitors Count: Calculates the count of unique visitors to a website per day using aggregation techniques in PySpark.

4.Spark Join Code Generator: A Python script that generates PySpark code for performing join operations on two datasets. The tool also offers an optional GitHub release feature to automatically deploy the generated code.

Detailed Descriptions

1.Age Categorization Using UDFs: A practical use case for UDFs, where age data is grouped into relevant categories for easy interpretation.

2.Top 3 Movies Based on Ratings: Helps understand aggregation and sorting in PySpark, demonstrating how to work with multiple DataFrames.

3.Unique Website Visitors Count: Shows how to handle date-based aggregations and distinct counts efficiently.

4.Spark Join Code Generator: Automates the creation of complex join operations, saving time for developers by generating reusable code.

License

This project is licensed under the MIT License.