We're thrilled you're interested in joining SimPPL! This assignment is designed to give you a practical, hands-on experience in social media analysis, mirroring the kind of work you'd be doing with us. It's structured like a mini-research project, challenging you to explore how information spreads across social networks, specifically focusing on content from potentially unreliable sources. Instead of building a data collection tool from scratch for this initial exercise, you'll be provided with existing social media data. Your task is to design and build an interactive dashboard to analyze and visualize this data, uncovering patterns and insights about how specific links, hashtags, keywords, or topics are being shared and discussed. This will allow you to focus on your data science, machine learning, and analysis skills, which are crucial to the research we conduct at SimPPL. The plots you create and the technologies you choose will be valuable learning experiences, and directly relevant to the work we do.
We have built tools for collecting and analyzing data from Reddit and Twitter including Parrot to study the sharing of news from certain unreliable Russian media providers. To ramp you up towards understanding how to go about extending such platforms, and to expand your understanding of the broader social media ecosystem, we would like you to construct a similar analysis to Parrot by studying other publicly accessible platforms listed above. We would like you to present an analysis of a broader range of viewpoints from different (apolitical / politically biased) groups. You may even pick a case study to present e.g. a relevant controversy, campaign, or civic event.
In the long run, this research intends to accomplish the following objectives:
- Track different popular trends to understand how public content is propagated on different social media platforms.
- Identify posts containing misleading information with the use of claims verification mechanisms.
- Analyze the trends across a large number of influential accounts over time in order to report on the influence of a narrative.
-
Visualize Insights: Tell a story with a graph, building intuitive and engaging data visualizations.
-
Apply AI/ML: Use LLMs and machine learning to generate metrics and enhance your analysis.
-
Build and Deploy a Dashboard: Develop and (ideally) host an interactive dashboard to showcase your analysis.
There are some hosted web demos (note that some are blog posts, but they include graphs that we would want you to develop in an interactive dashboard) that tell a story with data that you should look into. We do not expect you to replicate or copy any of these but we do want you to understand the "tell us a story with data" goal of this assignment better by looking at these:
- Fabio Gieglietto's TikTok Coordinated Behavior Report
- Integrity Institute's Misinformation Amplification Dashboard
- News Literacy Project Misinformation Dashboard
- Tableau examples (note: we don't use Tableau, and expect you to use Python or Javascript for this assignment, but these are interesting examples for inspiration)
Take a look at parrot that we have previously built as a visualization platform for Twitter data (it does not have search integrations though it is a good example of a solution other than that). Below is the rubric we will use for your evaluation, provided as a checklist for you to evaluate your own assignment before you submit it to us.
-
IMPORTANT Is the solution well-documented such that it is easy to understand its usage?
-
IMPORTANT Is the solution hosted (on a publicly accessible web dashboard) with a neatly designed frontend?
-
IMPORTANT Does the solution visualize summary statistics for the results? For example:
a. Time series of the number of posts matching a search query
b. Time series of key topics, themes, or trends in the content
c. Pie chart of communities (or accounts) on the social media platform that are key contributors to a set of results
d. Network visualization of accounts that have shared a particular keyword, hashtag, or URL using additional data they may have shared
-
Unique features (optional, but here are some creative and useful features past applicants have built that resulted in successful outcomes):
a. Topic models embedding all the content of results using Tensorflow projector (free, basic), Datamapplot (free, advanced), or Nomic (paid) as a platform to visualize the semantic map of the posts.
b. GenAI summaries of the time-series plots for non-technical audiences to understand the trends better.
c. Chatbot to query the data and answer questions that the user inputs about the trends for particular topics, themes, narratives, and news articles.
d. Connecting offline events from the news articles with the online sharing of posts on social media for specific searches (for example using Wikipedia to find key events in the Russian invasion of Ukraine and map them to the online narratives that are shared – though this is somewhat manual and not easy to automate, but extremely useful nevertheless).
e. Connecting multiple platform datasets together to search for data across multiple social platforms.
f. Semantic search after retrieving all posts matching a URL so that the retrieved results can be queried beyond keyword matching.
As a reminder, we expect you to host your Jupyter Notebook or JS dashboard on a publicly accessible website.
These instructions outline how to use GitHub for this assignment. Please follow them carefully to ensure your work is properly submitted.
-
Fork the Repository:
- Go to the assignment repository provided by the instructor: [Insert Repository Link Here]
- Click the "Fork" button in the top right corner of the page. This creates a copy of the repository in your GitHub account.
-
Clone Your Fork:
- Go to your forked repository (it will be in your GitHub account).
- Click the "Code" button (the green one) and copy the URL. This will be a git URL (ending in .git).
- Open a terminal or Git Bash on your local machine.
- Navigate to the directory where you want to work on the assignment using the cd command. For example:
cd /path/to/your/projects
. - Clone your forked repository using the following command: git clone <your_forked_repository_url> (Replace <your_forked_repository_url> with the URL you copied).
This will download the repository to your local machine.
-
Develop Your Solution
Work on your assignment within the cloned repository. Create your code files, visualizations, and any other required deliverables. Make sure to save your work regularly.
-
Commit Your Changes
- After making changes, you need to "stage" them for commit. This tells Git which changes you want to include in the next snapshot.
- Use the following command to stage all changes in the current directory:
- To add all the files - git add.
- Or, if you want to stage-specific files - git add ...
- Now, commit your staged changes with a descriptive message- git commit -m "Your commit message here" (Replace "Your commit message here" with a brief1 description of the changes you made.2 Be clear and concise!)
- Push your commits back to your forked repository on GitHub- git push origin main (Or, if you're working on a branch other than main, replace main with your branch name. origin refers to the remote repository you cloned from).
-
Please notify us of your submission by emailing simppl.collabs@gmail.com with the subject line "Submitting Research Engineer Intern Assignment for SimPPL".
Please ensure you include:
- A detailed README file (with screenshots of your solution, a URL to your publicly accessible hosted web platform).
- A text-based explanation of your code and thought process underlying system design.
- A link to a video recording of your dashboard hosted on YouTube or Google Drive. You can talk and explain your idea as you walk us through the platform.
Both of these last two make it easier for us to run your code and evaluate the assignment.
- OSINT Tools
- Colly
- AppWorld
- Scrapling
- Selenium
- Puppeteer
- DuckDB
- Cloudfare Workers
- Apache Superset
- Terraform
Focus on the analysis you are presenting and the story you are telling us through it. A well-designed and scalable system is more important than a complex one with a ton of features. Consider using innovative technologies in a user-friendly manner to create unique features for your platform such as AI-generated summaries that are adaptable to the data a user searches for, using your platform.
Presentation matters! Make sure your submission is easy to understand. Create an intuitive and meaningful README file or a Wiki that can be used to review your solution. Host it so it is accessible by anyone. Ensure that you share a video demo even if it is hosted, so that users understand how to interpret the insights you present.
At SimPPL, we're building tools to analyze how information spreads on social media, especially from unreliable sources. Your work will help inform how to scale our analysis to a wider range of platforms and handle larger datasets. This is crucial for tracking trends, identifying misinformation, and understanding how narratives spread online.
We're excited to see your solution!