Package on pypi: https://pypi.org/project/tweetfinder/
Code: https://github.com/dataculturegroup/Tweet-Finder
Documentation: https://tweet-finder.readthedocs.io
A small Python library for finding Tweets embedded in online news articles, and mentions of Tweets. We wrote this because we suspected that current research approaches were significantly under-counting the number of Tweets embedded in online news stories. Our initial evaluation confirms this.
Install with pip: pip install tweetfinder
.
from tweetfinder import Article
my_article = Article(url="http://my.news/article") # this will load and parse the article
# you can list discover all the tweets that are embedded in the HTML
num_embedded = my_article.count_embedded_tweets()
tweets_embedded = my_article.list_embedded_tweets() # metadata about tweets that are embedded
# you can also discover any mentions of twitter (in English), like "tweeted that" or "in a retweet"
num_mentions = my_article.count_mentioned_tweets()
tweet_mentions = my_article.list_mentioned_tweets() # list of text snippets that mention a tweet
Why are embedded tweets being undercounted? Two main reasons:
- Not everyone embeds tweets following the
blockquote
guidelines from Twitter - Many new websites render their content via Javascript, not raw HTML so unless you run in a browser and execute the Javascript, you won't see the embedded tweets on the page source
Some of our initial numbers behind this:
- Out of 1000 stories that mentioned twitter, our library found 640 embedded tweets in raw HTML
- Goose3, which is what current papers seems to use, found 518 in the same set of stories (ie. it missed about 20%)
- If you add in support for processing Javascript-based embeds, we found 859 (35% more) that traditional raw HTML-based counting approaches miss
These to-be-published results confirm our suspicion - most large quantitative news projects are under-counting embedded Tweets by around 35% or mre. This library is our attempt to help fix that.
Why does that matter? Understanding how Twitter (and other platforms) is used in news media is critical for building a better map of how the media ecosystem functions. News shapes how we see the world; studying the architectures of information flows around us is critical for preventing the spread of hate speech, misinformation, and supporting newsrooms and democracy.
When you create an Article the HTML is downloaded (if needed) and parsed immediately to find any mentions of twitter and any embedded tweets. There a number of methods to return the information found:
Return True
or False
depending on if there are any tweets embedded in the article.
Return the number of tweets embedded in the article.
Return a list
of dicts
with information about the tweets found. The properties in this dict
depend on how
we found the tweet. It could look like this:
[{
'tweet_id': '//twitter.com/sliccard',
'html_source': 'blockquote url fallback'
'username': '',
'full_url': 'https://twitter.com/sliccardo',
}]
Properties:
tweet_id
: the unique id of the tweet, can be used in concert with Twitter's API to pull more metadata (always included)html_source
: a string indicating which method the tweet was found with (always included)full_url
: the complete URL to the tweet on Twitter (sometimes included)username
: the twitter username of the author of the tweet, including the "@" (sometimes included)
Return True
or False
depending on if there are any mentions of tweets in the article.
Return the number of mentions of tweets in the article.
Return a list
of dicts
with information about the mention of a tweet. It will look like this:
[{
'phrase': 'tweeted',
'context': 'in March last year. He decided to comfort himself by bingeing on a favourite TV show. “I randomly tweeted something about putting on the first episode of a TV series. I’m slightly afraid to say that it was',
'content_start_index': '670',
}]
Properties:
phrase
: the phrase matched as a mention of twittercontext
: a window of characters around the phrease to help you understand where it occurredcontent_start_index
: the index intomy_article.get_content()
you can use to find the match
If you want to work on this module, clone the repo and install dependencies: make requirements-dev
.
- Run
make test
to make sure all the test pass - Update the version number in
tweetfinder/__init__.py
- Make a brief note in the version history section below about the changes
- Run
make sphinx-docs
to update the documentation - Run
make build-release
to create an install package - Run
make release-test
to upload it to PyPI's test platform - Run
make release
to upload it to PyPI
- v1.0.1: fix packaging to include data files required
- v1.0.0: added documentation and evaluation scripts
- v0.2.1: fix case-related bug in finding mentions
- v0.2.0: better documentation
- v0.1.0: initial release for testing
This library is part of the Media Cloud project, and is supported by the Co-Lab for Data Impact and the Data Culture Group at Northeastern University.
- Rahul Bhargava
- Dina Zemlyanker