Skip to content

varshithdupati/yelp-business-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yelp Dataset Analysis for Arizona Businesses


Hadoop Spark Python


Overview

This repository contains my Yelp dataset analysis project. The goal is to perform:

  1. Business-level analysis (Milestone 1) – focusing on attributes, ratings, locations, etc.
  2. User-level analysis (Milestone 2) – focusing on user behavior, sentiment, and user influence.

We use Apache Hadoop for distributed file storage and Apache Spark (PySpark) for data processing. The dataset is the Yelp Academic Dataset filtered to Arizona (AZ) businesses.


Requirements & Setup

1. Virtual Machine (Provided by Course)

  • A pre-configured VM (Ubuntu 22.04) is available with Hadoop, Spark, and PySpark already installed.
  • Username/Password: dps
  • If you need instructions, see the docs/VM-setup.md or the course instructions for VirtualBox/UTM usage.

2. Hadoop & Spark

  • Hadoop: v3.x
  • Spark: v3.x
    • Interactive shell:
      spark-shell
    • PySpark:
      pyspark

3. Python / PySpark

  • Python 3.8+ recommended
  • pyspark installed in the VM
  • Optionally Jupyter Notebook for an interactive environment:
    jupyter notebook
    # or: pyspark
    

Dataset

Yelp Academic Dataset:

  • Official Link – not included in this repo due to size.
  • We filter data for Arizona (state = 'AZ').
  • business_id & user_id are common across the business, review, user, checkin, tip JSON files.

Data Files

  • yelp_academic_dataset_business.json
  • yelp_academic_dataset_user.json
  • yelp_academic_dataset_review.json
  • yelp_academic_dataset_checkin.json
  • yelp_academic_dataset_tip.json

How to Run

  • Clone Repo:
    git clone https://github.com/<YourUser>/yelp-arizona-business-analysis.git
    cd yelp-arizona-business-analysis
    
  • Launch VM (if using course-provided OVA/UTM).
  • Start Hadoop & Spark (in the VM):
    hdfs namenode -format   # first time only
    start-dfs.sh
    start-yarn.sh
    pyspark
    
  • Open the Notebook:
    cd Milestone1-Business
    jupyter notebook Project1Milestone1.ipynb
    
  • Run cells to see queries & analysis. Similarly for Milestone2-User.

Milestone 1: Business-Level Analysis

  • Objective: Analyze AZ businesses, focusing on attributes, ratings, categories, location patterns.
  • Approach:
    • Convert JSON to Parquet or a suitable Spark format.
    • Filter to state='AZ'.
    • Perform SQL-like queries in Spark (e.g., spark.sql("SELECT ... FROM ... WHERE ...")).
    • Generate graphs & insights.
  • Queries:
    • 5 total (minimum), at least 3 complex queries combining multiple datasets.
    • Example: “Top 10 highest-rated businesses in the ‘Restaurants’ category within Phoenix.”

Milestone 2: User-Level Analysis

  • Objective: Analyze user behavior (reviews, tips, sentiment, influence). Approach:
  • Focus on users who reviewed the AZ businesses from Milestone 1.
  • Possibly do sentiment analysis on review.txt or tip.txt.
  • Check user attributes (average stars, friend count, compliment counts). Queries:
  • 10 total, 6 of which combine multiple datasets. Example: “Which users have the most influence (highest fans or compliment counts) in a specific business category?”

Results

  • Data filtering & approach
  • Spark queries (with code snippets, no direct copy from provided notebooks)
  • Graphs & figures
  • Key insights from business-level and user-level analysis

Acknowledgments

  • Dataset, test cases, etc. provided by Dr. Samira Ghayekhloo from Arizona State University.

License

This project is released under the MIT License. That means you’re free to use, modify, and distribute the code, but you do so at your own risk.


Contact

Author: Varshith Dupati
GitHub: @dvarshith
Email: dvarshith942@gmail.com
Issues: Please open an issue on this repo if you have questions or find bugs.

About

Big Data analysis on Yelp reviews/businesses for Arizona. Using Hadoop, Spark, PySpark.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published