Social Media is an integral part of our lives. It is omnipresent and influences our day-to-day decisions. We can collect a lot of this data and perform analysis on it. Using this data, we can gain insights on various situations. Web repository platforms are easy to use and manage open-source projects. Posting issues and getting them resolved is a part and parcel of open-source projects. On the other hand, for development related issues, support, any interesting findings and geeky discussions, users use social media platforms. We are collecting data from two sources, Reddit and GitHub. We used third party api, moderatehatespeech.com to predict toxicity of the records. Together, all the data will be used to explore the engagement and overall friendliness of the online communities. This can facilitate users while making decision to choose any technology stack. We will further this by creating an interactive dashboard to facilitate users with our findings. We propose using Streamlit, an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science.
- Akash Rasal
- Apoorv Yadav
python
- The project is developed and tested using python. Python Websitestreamlit
- The project is developed using streamlit as the backend. Streamlit Website Additionally, while developing the project 1 and 2, we used the following tools:postgres
- The primary database used for the project. Postgres Websitefaktory
- The primary job queue used for the project 2. Faktory Websitemoderatehatespeech
- The primary api used for the project 2. Moderate Hate Speech Websitematplotlib
- The primary plotting library used for the project. Matplotlib Website
;tldr: need to run project 2 to have live data page working, for all other charts, restoring database is more than enough
[ Entire System ] In order to get it completely running, we need to run the project 2 first. The project 2 is a data pipeline that collects data from various sources along with the toxicity and stores it in the database. It is an upgrade to project 1 where you only stored the data. The steps to get the project 2 running are listed in the project 2 readme. The prerequisite of running the project is the sytem should have postgres installed and running. Since, the steps can change with time, they can be found here.
[ restore database ]
On the other hand, if only dashboard needs to be run, we have provided the data in the data
folder. The data is collected from the project2. The only caveat is that the live data page will not be updating since the data is static. The steps to run the dashboard are listed below:
- Create a virtual environment
python3 -m venv dashboardenv
- Activate the virtual environment ( below command for linux & mac)
source dashboardenv/bin/activate
- Clone the repository and Change working directory
git clone git@github.com:2023-Fall-CS-415-515/project-3-implementation-ayadav7-arasal2.git
cd project-3-implementation-ayadav7-arasal2
- Install the requirements
pip install -r requirements.txt
- Restore the database using
pg_restore
command. The documentation to restore can be found here. Since the parameters can change the below command is just provided as an example. The command requires an existing db in postgres. We can create one in postgres cli usingcreate database <db_name>;
pg_restore -h localhost -p 5432 -U postgres -d <db_name> -v <backup>
- Create a
.env
file in the root directory and add the following lines to it, ensure that you keep the db_name same as what you restored into:
DATABASE_URL=postgres://postgres:<ur_password>@localhost:5432/<db_name>
- Run the dashboard
streamlit run app.py
or to not send usage stats to streamlit
streamlit run app.py --browser.gatherUsageStats False
- Open the browser and go to the url
http://localhost:8501/
or the one shown on terminal asNetwork URL: http://<ip>:8501
app.py
- The main file that runs the dashboard and displays home page.data
- The folder that contains the data for the dashboard. To restore it, we need postgres and pg_restore. It requires the postgres 16 for it to work.live_data.py
- The file that contains the code for the live data page. The page updates every 5 minutes automatically.all_data.py
- The file that contains the code for the data overview page. The user can check overall data collection overtime for both reddit and github.toxicity.py
- The file that contains the code for the toxicity page. User can select date range and a particular subreddit for a detailed view.about.py
- The file that contains the code for the about page.utils.py
- The file that contains the utility functions for the dashboard.connect_to_db.py
- The file that contains the code to connect to the database.requirements.txt
- The file that contains the dependencies for the dashboard.
- Home page
- Data Overview page
- Live Data page
- Toxicity page
- Activity page
- About page