CSYE7200-Final-Project

Abstract

The goal of the project is to stream tweets everyday to extract wordle results of users for that day (each day's wordle) and find the top 10 users for each user who match their wordle patterns, to see who has estimated or guessed the word in the same pattern that the user did, in light of the general curiosity of users who share their wordle results.

Methodology

Streamed tweets through twint everyday and stored results in mysql database, preprocessed the streamed tweets to calculate all pairs user distances to find top 10 users who played similarly for a given user.
Data extracted via twint was extremely raw; unicode formatted data. A series of transformations done to compute wordle sequence vectors, which were used to compute distances.
Tweets acquired by twint API were in JSON format. Built a sample JSON parser to correctly parse the input.

Instructions to run project

Prerquisites: Zookeeper, Kafka, Mysql
Instructions to install kafka: https://kafka.apache.org/quickstart
Start Kafka server
Run producer.py from kafka-streamGen, to start scraping tweets using twint
Run the main function in streaming.stream.StreamTweets.scala. This will stream tweets from the kafka topic and store processes tweets in the mysql database
Run the main function in streaming.stream.ProcessTweets.scala. This will calculate all pairs user distances and store the results to be used for visualization.

Running Databricks visualization:

Databricks Community Edition does not seem to make sharing notebooks between users simple. The following instructions describe how to access Databricks, create the compute cluster, run the notebook, and visualize the dashboard.

Navigate to https://community.cloud.databricks.com/login.html and login with the username found in the DatabricksLink.txt file. The password will be provided to the Professor and TA privately.

Once you have logged into Databricks, navigate to the compute tab and create a new compute cluster. In the community edition, a new cluster needs to be created after 2 hours of inactivity. It may take up to 5 minutes for the cluster to be created.

Next, navigate to the notebook by clicking "Workspace" -> "Users" -> "CSYE-7200 Project"

You should now be able to view and run the code within the Databricks notebook.

To view the dashboard click "View" and under "Dashboards" select "FinalProject".

Once the dashboard is open, you should be seeing the top 10 most similar Wordle results for the user whose username was entered in the textbox at the top of the dashboard. Manually narrow the columns until they are only wide enough to display 5 blocks. Click the arrow next to the Wordle pattern to visualize it in the 5 x 6 matrix we are used to seeing. This is a work around for the fact that you cannot set fixed column widths in Databricks. Entering a different username in the textbox will cause the dashboard to refresh and the width of the columns will need to be manually set again.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
kafka-streamGen		kafka-streamGen
stream-tweets		stream-tweets
.gitignore		.gitignore
README.md		README.md
Wordle Match Planning.pptx		Wordle Match Planning.pptx
Wordle Match.pptx.pdf		Wordle Match.pptx.pdf
db_schema.sql		db_schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSYE7200-Final-Project

Abstract

Methodology

Instructions to run project

Running Databricks visualization:

About

Releases

Packages

Contributors 4

Languages

puppala-sumana/CSYE7200-Final-Project

Folders and files

Latest commit

History

Repository files navigation

CSYE7200-Final-Project

Abstract

Methodology

Instructions to run project

Running Databricks visualization:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages