Skip to content

Large-scale social network analysis system processing Twitter data with PageRank algorithms on AWS EMR.

License

Notifications You must be signed in to change notification settings

misran3/twitter-social-network-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Social Network Analysis

Overview

This project analyzes Twitter social networks to find influential users and generate follower recommendations. It runs large-scale graph analytics using Apache Spark to compute PageRank scores and topic-based user recommendations. The system processes Twitter follower relationships and user interests to identify key influencers and suggest connections between users with similar topics like games, movies, and music.

Tech Architecture

The system runs on AWS using EMR clusters with Spark for distributed processing. All data lives in S3 buckets for storage and results output. The infrastructure gets deployed through AWS CDK, which sets up the EMR cluster, security groups, IAM roles, and S3 buckets automatically.

The Scala applications use Spark's GraphX and SQL libraries to process the social network data. Each analysis type has its own processor that handles the specific algorithms. Results get saved back to S3 in Parquet format for easy querying and further analysis.

Usage

First, deploy the AWS infrastructure using CDK:

cd cdk
npm install
cdk deploy

This creates your EMR cluster, S3 buckets, and security groups. Once deployed, run the analysis using the provided script:

chmod +x ./run-emr-step.sh
./run-emr-step.sh

The script will build your project, upload the JAR to S3, and let you choose from three analysis types:

  1. Basic Network Analysis - Graph statistics and top users by follower count
  2. PageRank Analysis - User influence scoring using PageRank algorithm
  3. User Recommendations - Follower suggestions based on topic interests

Your input data should be tab-separated files. The edges file contains follower relationships (source\tdestination), and the topics file includes user interests (userId\tgames\tmovies\tmusic).

Results will be saved in the specified S3 output path in Parquet format.

Running tests

To run unit tests for the Scala applications, use the following command:

sbt test

About

Large-scale social network analysis system processing Twitter data with PageRank algorithms on AWS EMR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors