Skip to content

Semantic Web project. Tweets analysis comparing the text with fact checking websites.

License

Notifications You must be signed in to change notification settings

edoardopieropan/twitter_fake_news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Fake News - Fact Checking

License: GPL v3 made-with-python

University project - Semantic Web

Development of a web application based on Python and Flask for automatic fact-checking between tweets from Italy and two famous italian news debunkers.

What is it?

This application let you download some tweets using Tweepy and test what a person think of a certain tweet: is it a fake news or not? An algorithm of POS tagging and analysis based on spaCy will try to automatically detect if a tweet is/talk about a fake news, comparing it with italian articles from fact-checking websites. Currently, using website scraping methods, we retrieve fake news debunking articles from bufale.net and BUTAC.

Setup & Run

Twitter Developer APIs

Twitter needs a developer authentication for use their APIs. Before start, please make sure to request your keys at https://developer.twitter.com/. Notice that it will take about 14 days according to Twitter.

Clone this repository. Copy your keys in the config.py file. You will need:

  1. consumer_key
  2. consumer_secret

Run the app

Create a virtual environment:

python3 -m venv project-venv/

activate it:

source venv/bin/activate

then install the requirements:

pip install -r requirements.txt

For language analysis we use spaCy (Github), a free open-source library for advanced Natural Language Processing (NLP) in Python. It supports also the Italian language. Beginners can have a look at https://spacy.io/usage/spacy-101.

If there is a problem installing 'it-core-news-lg==2.3.0'1, use this command:

python -m spacy download it_core_news_lg

and reinstall requirements.

1 the _lg model supports multi-task CNN trained on UD Italian ISDT and WikiNER. Assigns context-specific token vectors, POS tags, dependency parses and named entities. The shorter model _sm was not enough for some advanced analysis.

For running the application:

flask run

then you can open the website at localhost:5000 on your browser.

Application structure

> Homepage

The homepage contains some instruction explaining what the tester is meant to do. When you launch the application for the first time you have to download some tweets using the form in the Tweet Sets page. Once you collect some sets, you can start a new test from the Start Test page.

> Tweet sets

An administration page for managing tweet sets.

To download a new set you have to specify:

  1. Name of the set
  2. A search query (with or without hastags, e.g. #word, hello world, etc.)
  3. Number of tweets

When the submit button is pressed the application will add the new set in a .json containing all the previously downloaded sets:

[
    {
        "id": "amadeus_1612474452",
        "set_name": "amadeus",
        "search_keyword": "sanremo",
        "tweets_number": 5
    },
    {
        "id": "covid_1612630237",
        "set_name": "covid",
        "search_query": "vaccini",
        "tweets_number": 5
    }
]

Note that the set id is the concatenation of the name and the creation timestamp.

The code will now compare all the founded tweets with articles from two fact-checking italian websites. The similarity methods are the following:

  1. test1: stop-words and punctuation removal + lemmatization. The tweet is compared with the article's title.
  2. test2: stop-words and punctuation removal + lemmatization. The tweet is compared with the article's body.
  3. test3: keep only nouns. The tweet is compared with the article's title.
  4. test4: keep only nouns. The tweet is compared with the article's body.
  5. test5: stop-words and punctuation removal + lemmatization. If len(tweet) = n, the tweet is compared with the first n words in the article's body.
  6. test6: stop-words and punctuation removal + lemmatization. The tweet is compared with the 30% of the words from the article's body with the greatest TF-IDF index.

Finally, the results will be stored in set specific .json file, structured like the following:

[
    {
        "source": "tweet_author",
        "text": "This is a tweet!",
        "created_at": "2021-02-10 15:58:46",
        "id": 1359532267572453377,
        "progressive": "t0",
        "fact_checking": {
            "test1": {
                "similarity": 0.58,
                "fn_url": "https://www.bufale.net/link_to_the article"
            },
            "test2": {
                "similarity": 0.6,
                "fn_url": "https://www.bufale.net/link_to_the article"
            },
            "test3": {
                "similarity": 0.39,
                "fn_url": "https://www.bufale.net/link_to_the article"
            },
            "test4": {
                "similarity": 0.57,
                "fn_url": "https://www.bufale.net/link_to_the article"
            }
        }
    }
]

Each tweet in the .json contains information about the source, the text, the similarity obtained from the methods and for every test the url for the article with the higher similarity.

> Test page

The subject is requested to choose a username and a tweet set, from those previously downloaded.

The tweets contained in the selected set will be displayed and the user will choose for each tweet between True, Maybe and Fake.

After finishing the test, a .json containing all the session info is saved:

{
    "id": "andrea_1612475354",
    "username": "andrea",
    "tweets_set_id": "covid_1612630237",
    "start_timestamp": 1612973659,
    "finish_timestamp": 1612973701,
    "user_choices": {
        "1359532267572453377": "Maybe",
        "1358095041076617217": "Fake",
        "1358094894586359809": "True",
        "1358094801757999107": "Maybe",
        "1358094548744998913": "True"
    }
}

and the user is added to a .json containing all the test sessions:

[
    {
        "id": "andrea_1612475354",
        "username": "andrea"
    },
    {
        "id": "edoardo_1612475580",
        "username": "edoardo"
    }
]

Note that a user/session id is the concatenation of the username and the start timestamp.

License

Before use it we invite you to read the LICENSE.

This file is distributed under the terms of the GNU General Public License v3.0
Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights.


Visit http://www.gnu.org/licenses/ for further information.

References

Web Semantico
A.Y. 2019/2020
University of Verona (Italy)

Repository Authors:
Edoardo Pieropan
Andrea Toaiari