Development of a web application based on Python and Flask for automatic fact-checking between tweets from Italy and two famous italian news debunkers.
This application let you download some tweets using Tweepy and test what a person think of a certain tweet: is it a fake news or not? An algorithm of POS tagging and analysis based on spaCy will try to automatically detect if a tweet is/talk about a fake news, comparing it with italian articles from fact-checking websites. Currently, using website scraping methods, we retrieve fake news debunking articles from bufale.net and BUTAC.
Twitter needs a developer authentication for use their APIs. Before start, please make sure to request your keys at https://developer.twitter.com/. Notice that it will take about 14 days according to Twitter.
Clone this repository. Copy your keys in the config.py file. You will need:
- consumer_key
- consumer_secret
Create a virtual environment:
python3 -m venv project-venv/
activate it:
source venv/bin/activate
then install the requirements:
pip install -r requirements.txt
For language analysis we use spaCy (Github), a free open-source library for advanced Natural Language Processing (NLP) in Python. It supports also the Italian language. Beginners can have a look at https://spacy.io/usage/spacy-101.
If there is a problem installing 'it-core-news-lg==2.3.0'1, use this command:
python -m spacy download it_core_news_lg
and reinstall requirements.
1 the _lg model supports multi-task CNN trained on UD Italian ISDT and WikiNER. Assigns context-specific token vectors, POS tags, dependency parses and named entities. The shorter model _sm was not enough for some advanced analysis.
For running the application:
flask run
then you can open the website at localhost:5000
on your browser.
The homepage contains some instruction explaining what the tester is meant to do. When you launch the application for the first time you have to download some tweets using the form in the Tweet Sets page. Once you collect some sets, you can start a new test from the Start Test page.
An administration page for managing tweet sets.
To download a new set you have to specify:
- Name of the set
- A search query (with or without hastags, e.g. #word, hello world, etc.)
- Number of tweets
When the submit button is pressed the application will add the new set in a .json containing all the previously downloaded sets:
[
{
"id": "amadeus_1612474452",
"set_name": "amadeus",
"search_keyword": "sanremo",
"tweets_number": 5
},
{
"id": "covid_1612630237",
"set_name": "covid",
"search_query": "vaccini",
"tweets_number": 5
}
]
Note that the set id is the concatenation of the name and the creation timestamp.
The code will now compare all the founded tweets with articles from two fact-checking italian websites. The similarity methods are the following:
- test1: stop-words and punctuation removal + lemmatization. The tweet is compared with the article's title.
- test2: stop-words and punctuation removal + lemmatization. The tweet is compared with the article's body.
- test3: keep only nouns. The tweet is compared with the article's title.
- test4: keep only nouns. The tweet is compared with the article's body.
- test5: stop-words and punctuation removal + lemmatization. If len(tweet) = n, the tweet is compared with the first n words in the article's body.
- test6: stop-words and punctuation removal + lemmatization. The tweet is compared with the 30% of the words from the article's body with the greatest TF-IDF index.
Finally, the results will be stored in set specific .json file, structured like the following:
[
{
"source": "tweet_author",
"text": "This is a tweet!",
"created_at": "2021-02-10 15:58:46",
"id": 1359532267572453377,
"progressive": "t0",
"fact_checking": {
"test1": {
"similarity": 0.58,
"fn_url": "https://www.bufale.net/link_to_the article"
},
"test2": {
"similarity": 0.6,
"fn_url": "https://www.bufale.net/link_to_the article"
},
"test3": {
"similarity": 0.39,
"fn_url": "https://www.bufale.net/link_to_the article"
},
"test4": {
"similarity": 0.57,
"fn_url": "https://www.bufale.net/link_to_the article"
}
}
}
]
Each tweet in the .json contains information about the source, the text, the similarity obtained from the methods and for every test the url for the article with the higher similarity.
The subject is requested to choose a username and a tweet set, from those previously downloaded.
The tweets contained in the selected set will be displayed and the user will choose for each tweet between True, Maybe and Fake.
After finishing the test, a .json containing all the session info is saved:
{
"id": "andrea_1612475354",
"username": "andrea",
"tweets_set_id": "covid_1612630237",
"start_timestamp": 1612973659,
"finish_timestamp": 1612973701,
"user_choices": {
"1359532267572453377": "Maybe",
"1358095041076617217": "Fake",
"1358094894586359809": "True",
"1358094801757999107": "Maybe",
"1358094548744998913": "True"
}
}
and the user is added to a .json containing all the test sessions:
[
{
"id": "andrea_1612475354",
"username": "andrea"
},
{
"id": "edoardo_1612475580",
"username": "edoardo"
}
]
Note that a user/session id is the concatenation of the username and the start timestamp.
Before use it we invite you to read the LICENSE.
This file is distributed under the terms of the GNU General Public License v3.0
Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights.
Visit http://www.gnu.org/licenses/ for further information.
Web Semantico
A.Y. 2019/2020
University of Verona (Italy)
Repository Authors:
Edoardo Pieropan
Andrea Toaiari