Analytical Approach in Making Technical Stack Selection

Abstract

Social Media is an integral part of our lives. It is omnipresent and influences our day-to-day decisions. We can collect a lot of this data and perform analysis on it. Using this data, we can gain insights on various situations. Web repository platforms are easy to use and manage open-source projects. Posting issues and getting them resolved is a part and parcel of open-source projects. On the other hand, for development related issues, support, any interesting findings and geeky discussions, users use social media platforms. We are collecting data from two sources, Reddit and GitHub. We used third party api, moderatehatespeech.com to predict toxicity of the records. Together, all the data will be used to explore the engagement and overall friendliness of the online communities. This can facilitate users while making decision to choose any technology stack. We will further this by creating an interactive dashboard to facilitate users with our findings. We propose using Streamlit, an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science.

Team

Akash Rasal
Apoorv Yadav

Tools used

python - The project is developed and tested using python. Python Website
streamlit - The project is developed using streamlit as the backend. Streamlit Website Additionally, while developing the project 1 and 2, we used the following tools:
postgres - The primary database used for the project. Postgres Website
faktory - The primary job queue used for the project 2. Faktory Website
moderatehatespeech - The primary api used for the project 2. Moderate Hate Speech Website
matplotlib - The primary plotting library used for the project. Matplotlib Website

Running this dashboard

;tldr: need to run project 2 to have live data page working, for all other charts, restoring database is more than enough

[ Entire System ] In order to get it completely running, we need to run the project 2 first. The project 2 is a data pipeline that collects data from various sources along with the toxicity and stores it in the database. It is an upgrade to project 1 where you only stored the data. The steps to get the project 2 running are listed in the project 2 readme. The prerequisite of running the project is the sytem should have postgres installed and running. Since, the steps can change with time, they can be found here.

[ restore database ] On the other hand, if only dashboard needs to be run, we have provided the data in the data folder. The data is collected from the project2. The only caveat is that the live data page will not be updating since the data is static. The steps to run the dashboard are listed below:

Create a virtual environment

python3 -m venv dashboardenv

Activate the virtual environment ( below command for linux & mac)

source dashboardenv/bin/activate

Clone the repository and Change working directory

git clone git@github.com:2023-Fall-CS-415-515/project-3-implementation-ayadav7-arasal2.git
cd project-3-implementation-ayadav7-arasal2

Install the requirements

pip install -r requirements.txt

Restore the database using pg_restore command. The documentation to restore can be found here. Since the parameters can change the below command is just provided as an example. The command requires an existing db in postgres. We can create one in postgres cli using create database <db_name>;

pg_restore -h localhost -p 5432 -U postgres -d <db_name> -v <backup>

Create a .env file in the root directory and add the following lines to it, ensure that you keep the db_name same as what you restored into:

DATABASE_URL=postgres://postgres:<ur_password>@localhost:5432/<db_name>

Run the dashboard

streamlit run app.py

or to not send usage stats to streamlit

streamlit run app.py --browser.gatherUsageStats False

Open the browser and go to the url http://localhost:8501/ or the one shown on terminal as Network URL: http://<ip>:8501

Project Structure

app.py - The main file that runs the dashboard and displays home page.
data - The folder that contains the data for the dashboard. To restore it, we need postgres and pg_restore. It requires the postgres 16 for it to work.
live_data.py - The file that contains the code for the live data page. The page updates every 5 minutes automatically.
all_data.py - The file that contains the code for the data overview page. The user can check overall data collection overtime for both reddit and github.
toxicity.py - The file that contains the code for the toxicity page. User can select date range and a particular subreddit for a detailed view.
about.py - The file that contains the code for the about page.
utils.py - The file that contains the utility functions for the dashboard.
connect_to_db.py - The file that contains the code to connect to the database.
requirements.txt - The file that contains the dependencies for the dashboard.

System Architecture

Running Dashboard Screenshots

Home page

Data Overview page

Live Data page

Toxicity page

Activity page

About page

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
final		final
p1		p1
p2		p2
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytical Approach in Making Technical Stack Selection

Abstract

Team

Tools used

Running this dashboard

Project Structure

System Architecture

Running Dashboard Screenshots

References

About

Releases

Packages

Contributors 2

Languages

apoorvyadav1111/analyze-open-source-communities

Folders and files

Latest commit

History

Repository files navigation

Analytical Approach in Making Technical Stack Selection

Abstract

Team

Tools used

Running this dashboard

Project Structure

System Architecture

Running Dashboard Screenshots

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages