Skip to content

A beginner-friendly PySpark learning repository that covers essential functions, transformations, and actions to get started with big data processing and distributed computing. Perfect for those new to PySpark and Apache Spark.

License

Notifications You must be signed in to change notification settings

deypadma2020/PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark


⚡ PySpark Basics for Beginners

A beginner-friendly PySpark repository containing essential functions, transformations, and actions to help you quickly learn and apply Apache Spark for big data processing and distributed computing.

This repo is designed as a hands-on guide to start learning PySpark from scratch — covering DataFrames, RDDs, SQL, and commonly used operations.


📌 What You’ll Learn here

  • 🔹 Setting up PySpark environment
  • 🔹 Creating RDDs & DataFrames
  • 🔹 Basic operations: select, filter, withColumn, groupBy, agg, etc.
  • 🔹 Handling missing values & data cleaning
  • 🔹 Joins & SQL queries with SparkSQL
  • 🔹 Common transformations & actions
  • 🔹 Saving and loading data (CSV, Parquet, JSON)

📁 Repository Structure

pyspark-basics/
│
├── notebooks/                # Jupyter notebooks for practice
│   ├── 01_intro.ipynb
│   ├── 02_dataframes.ipynb
│   ├── 03_transformations.ipynb
│   └── 04_spark_sql.ipynb
│
├── examples/                 # Python scripts with basic functions
│   ├── dataframe_examples.py
│   ├── rdd_examples.py
│   └── spark_sql_examples.py
│
├── requirements.txt          # Dependencies
└── README.md                 # You are here 🚀

⚙️ Installation & Setup

1️⃣ Install PySpark (locally):

pip install pyspark

2️⃣ Verify Installation:

python -c "import pyspark; print(pyspark.__version__)"

3️⃣ Start Jupyter Notebook to practice:

jupyter notebook

🚀 Quick Start Example

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("BeginnerApp").getOrCreate()

# Create DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Show data
df.show()

# Basic operations
df.select("Name").show()
df.filter(df.Age > 28).show()

Output:

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

📊 Topics Covered

  • ✅ DataFrames & RDDs
  • ✅ Transformations & Actions
  • ✅ Joins & Aggregations
  • ✅ SparkSQL queries
  • ✅ File I/O (CSV, JSON, Parquet)
  • ✅ Beginner-friendly examples

🧑‍💻 Who is this for?

  • 🔹 Students learning Big Data & Spark
  • 🔹 Data engineers starting with ETL pipelines
  • 🔹 Beginners who want to practice PySpark functions hands-on

📌 Future Enhancements

  • Add MLlib (Spark Machine Learning) basics
  • Add streaming examples (Spark Streaming)
  • Add optimization tips for large datasets

🪪 License

This project is licensed under the MIT License – feel free to use and contribute.


Happy Learning PySpark!

About

A beginner-friendly PySpark learning repository that covers essential functions, transformations, and actions to get started with big data processing and distributed computing. Perfect for those new to PySpark and Apache Spark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published