Yelp Dataset Analysis for Arizona Businesses

Overview

This repository contains my Yelp dataset analysis project. The goal is to perform:

Business-level analysis (Milestone 1) – focusing on attributes, ratings, locations, etc.
User-level analysis (Milestone 2) – focusing on user behavior, sentiment, and user influence.

We use Apache Hadoop for distributed file storage and Apache Spark (PySpark) for data processing. The dataset is the Yelp Academic Dataset filtered to Arizona (AZ) businesses.

Requirements & Setup

1. Virtual Machine (Provided by Course)

A pre-configured VM (Ubuntu 22.04) is available with Hadoop, Spark, and PySpark already installed.
Username/Password: dps
If you need instructions, see the docs/VM-setup.md or the course instructions for VirtualBox/UTM usage.

2. Hadoop & Spark

Hadoop: v3.x
- Start services:
```
hdfs namenode -format   # first time only
start-dfs.sh
start-yarn.sh
```
- Web UIs:
  - HDFS: http://localhost:9870
  - Yarn: http://localhost:8088
Spark: v3.x
- Interactive shell:
```
spark-shell
```
- PySpark:
```
pyspark
```

3. Python / PySpark

Python 3.8+ recommended
pyspark installed in the VM
Optionally Jupyter Notebook for an interactive environment:
```
jupyter notebook
# or: pyspark
```

Dataset

Yelp Academic Dataset:

Official Link – not included in this repo due to size.
We filter data for Arizona (state = 'AZ').
business_id & user_id are common across the business, review, user, checkin, tip JSON files.

Data Files

yelp_academic_dataset_business.json
yelp_academic_dataset_user.json
yelp_academic_dataset_review.json
yelp_academic_dataset_checkin.json
yelp_academic_dataset_tip.json

How to Run

Clone Repo:

git clone https://github.com/<YourUser>/yelp-arizona-business-analysis.git
cd yelp-arizona-business-analysis

Launch VM (if using course-provided OVA/UTM).

Start Hadoop & Spark (in the VM):

hdfs namenode -format   # first time only
start-dfs.sh
start-yarn.sh
pyspark

Open the Notebook:

cd Milestone1-Business
jupyter notebook Project1Milestone1.ipynb

Run cells to see queries & analysis. Similarly for Milestone2-User.

Milestone 1: Business-Level Analysis

Objective: Analyze AZ businesses, focusing on attributes, ratings, categories, location patterns.
Approach:
- Convert JSON to Parquet or a suitable Spark format.
- Filter to state='AZ'.
- Perform SQL-like queries in Spark (e.g., spark.sql("SELECT ... FROM ... WHERE ...")).
- Generate graphs & insights.
Queries:
- 5 total (minimum), at least 3 complex queries combining multiple datasets.
- Example: “Top 10 highest-rated businesses in the ‘Restaurants’ category within Phoenix.”

Milestone 2: User-Level Analysis

Objective: Analyze user behavior (reviews, tips, sentiment, influence). Approach:
Focus on users who reviewed the AZ businesses from Milestone 1.
Possibly do sentiment analysis on review.txt or tip.txt.
Check user attributes (average stars, friend count, compliment counts). Queries:
10 total, 6 of which combine multiple datasets. Example: “Which users have the most influence (highest fans or compliment counts) in a specific business category?”

Results

Data filtering & approach
Spark queries (with code snippets, no direct copy from provided notebooks)
Graphs & figures
Key insights from business-level and user-level analysis

Acknowledgments

Dataset, test cases, etc. provided by Dr. Samira Ghayekhloo from Arizona State University.

License

This project is released under the MIT License. That means you’re free to use, modify, and distribute the code, but you do so at your own risk.

Contact

Author: Varshith Dupati
GitHub: @dvarshith
Email: dvarshith942@gmail.com
Issues: Please open an issue on this repo if you have questions or find bugs.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Milestone1-Business		Milestone1-Business
Milestone2-User		Milestone2-User
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Yelp Dataset Analysis for Arizona Businesses

Overview

Requirements & Setup

1. Virtual Machine (Provided by Course)

2. Hadoop & Spark

3. Python / PySpark

Dataset

Yelp Academic Dataset:

Data Files

How to Run

Milestone 1: Business-Level Analysis

Milestone 2: User-Level Analysis

Results

Acknowledgments

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

varshithdupati/yelp-business-analysis

Folders and files

Latest commit

History

Repository files navigation

Yelp Dataset Analysis for Arizona Businesses

Overview

Requirements & Setup

1. Virtual Machine (Provided by Course)

2. Hadoop & Spark

3. Python / PySpark

Dataset

Yelp Academic Dataset:

Data Files

How to Run

Milestone 1: Business-Level Analysis

Milestone 2: User-Level Analysis

Results

Acknowledgments

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages