Skip to content

apoorvyadav1111/analyze-open-source-communities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Analytical Approach in Making Technical Stack Selection

Abstract

Social Media is an integral part of our lives. It is omnipresent and influences our day-to-day decisions. We can collect a lot of this data and perform analysis on it. Using this data, we can gain insights on various situations. Web repository platforms are easy to use and manage open-source projects. Posting issues and getting them resolved is a part and parcel of open-source projects. On the other hand, for development related issues, support, any interesting findings and geeky discussions, users use social media platforms. We are collecting data from two sources, Reddit and GitHub. We used third party api, moderatehatespeech.com to predict toxicity of the records. Together, all the data will be used to explore the engagement and overall friendliness of the online communities. This can facilitate users while making decision to choose any technology stack. We will further this by creating an interactive dashboard to facilitate users with our findings. We propose using Streamlit, an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science.

Team

  • Akash Rasal
  • Apoorv Yadav

Tools used

  • python - The project is developed and tested using python. Python Website
  • streamlit - The project is developed using streamlit as the backend. Streamlit Website Additionally, while developing the project 1 and 2, we used the following tools:
  • postgres - The primary database used for the project. Postgres Website
  • faktory - The primary job queue used for the project 2. Faktory Website
  • moderatehatespeech - The primary api used for the project 2. Moderate Hate Speech Website
  • matplotlib - The primary plotting library used for the project. Matplotlib Website

Running this dashboard

;tldr: need to run project 2 to have live data page working, for all other charts, restoring database is more than enough

[ Entire System ] In order to get it completely running, we need to run the project 2 first. The project 2 is a data pipeline that collects data from various sources along with the toxicity and stores it in the database. It is an upgrade to project 1 where you only stored the data. The steps to get the project 2 running are listed in the project 2 readme. The prerequisite of running the project is the sytem should have postgres installed and running. Since, the steps can change with time, they can be found here.

[ restore database ] On the other hand, if only dashboard needs to be run, we have provided the data in the data folder. The data is collected from the project2. The only caveat is that the live data page will not be updating since the data is static. The steps to run the dashboard are listed below:

  1. Create a virtual environment
python3 -m venv dashboardenv
  1. Activate the virtual environment ( below command for linux & mac)
source dashboardenv/bin/activate
  1. Clone the repository and Change working directory
git clone git@github.com:2023-Fall-CS-415-515/project-3-implementation-ayadav7-arasal2.git
cd project-3-implementation-ayadav7-arasal2 
  1. Install the requirements
pip install -r requirements.txt
  1. Restore the database using pg_restore command. The documentation to restore can be found here. Since the parameters can change the below command is just provided as an example. The command requires an existing db in postgres. We can create one in postgres cli using create database <db_name>;
pg_restore -h localhost -p 5432 -U postgres -d <db_name> -v <backup>
  1. Create a .env file in the root directory and add the following lines to it, ensure that you keep the db_name same as what you restored into:
DATABASE_URL=postgres://postgres:<ur_password>@localhost:5432/<db_name>
  1. Run the dashboard
streamlit run app.py

or to not send usage stats to streamlit

streamlit run app.py --browser.gatherUsageStats False
  1. Open the browser and go to the url http://localhost:8501/ or the one shown on terminal as Network URL: http://<ip>:8501

Project Structure

  • app.py - The main file that runs the dashboard and displays home page.
  • data - The folder that contains the data for the dashboard. To restore it, we need postgres and pg_restore. It requires the postgres 16 for it to work.
  • live_data.py - The file that contains the code for the live data page. The page updates every 5 minutes automatically.
  • all_data.py - The file that contains the code for the data overview page. The user can check overall data collection overtime for both reddit and github.
  • toxicity.py - The file that contains the code for the toxicity page. User can select date range and a particular subreddit for a detailed view.
  • about.py - The file that contains the code for the about page.
  • utils.py - The file that contains the utility functions for the dashboard.
  • connect_to_db.py - The file that contains the code to connect to the database.
  • requirements.txt - The file that contains the dependencies for the dashboard.

System Architecture

image

Running Dashboard Screenshots

  1. Home page
image
  1. Data Overview page
image image
  1. Live Data page
image
  1. Toxicity page
image image
  1. Activity page
image
  1. About page
image image

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published