Sample project for using stylometry to deanonymize Twitter account author.
$ pip install tweepy numpy unidecode nltk scipy sklearn
$ python3
In Python, import nltk
and download Model punkt
.
>>> import nltk
>>> nltk.download()
$ git clone https://github.com/ViliamV/stylometry.git
$ cd stylometry/
- Follow these steps.
- Input credentials into
twitter-API.txt
- Create
accounts.txt
in main directory and put there account's names to download, one in each line. Put the unknown author's account last. - Create directory
data
in main directory. - Run
tweet-downloader.py
and wait. Due to Twitter API speed, it might take a while. - Verify if
data
contains downloaded tweets.
- Edit
classification.py
and change valueUNKNOWN
(line 28) to unknown author's account.
UNKNOWN="example_account"
- Run
classification.py
.
This code uses Bag of Words model for extracting features from the text. A great introduction for implementing this model can be found here.
The code also uses Czech stopwords and Czech tokenizer, however, it is quite simple to change it.