An algorithm for extracting posts on Wikipedia page deletion discussions.
The scripts
directory contains many scripts for data analysis.
Mainly, it allows running an SVM, a multinomial naïve Bayes and a language model classifier on the data.
Performances can be compared and the most influential features can be inspected to deduce the impact that words may have on the likeliness of a user being blocked afterwards.
The scripts allow full text (i.e. without stop word removal) classification as well as a classification restricted to a list of function words.
For more information, especially considering this software's use and the results obtained thereby, please refer to the corresponding master's thesis:
“Did I Say Something Wrong?” A Word-Level Analysis of Wikipedia Articles for Deletion Discussions
WikiWho has been tested on Arch Linux running Python 3.4.3 and Ubuntu 12.04 running Python 3.2.3.
WikiWho utilises the MediaWiki Utilities library to process the revisioned content extracted from Wikipedia. These functions can be downloaded from the official MediaWiki Utilities repository (under the MIT license) at the following link:
This file is the core of this project. Per default, it is used to extract deletion discussions from Wikipedia page dumps, attribute their authorship and write them to disk together with the amounts of seconds between the post creation and the author being blocked.
-1
expresses that the user has not been blocked afterwards.
If the condition
parameter isRegisteredUserTalk
is passed, it parses user talk pages and writes a block file according to warning templates it found on the user's talk page.
Only a subset of templates are actually considered.
See the writeUserWarning(text, revision, pageName)
method for more information.
-i [source_file_name or directory]
(complete history dump of articles, either as XML, bzip2, gzip, LZMA or 7zip. Alternatively, if a directory is specified, all files residing in it, matching one of the supported file types, will be processed.)-b [<block log>]
(optional when the condition isisRegisteredUserTalk
. The block log constructed from the Wikipedia data dumps' logging dump through 0nse/WikiParser.)-c [<condition>]
(optional. It can beisDeletionDiscussion
for AfD orisRegisteredUserTalk
for user talk. The default isisDeletionDiscussion
.
python WikiwhoRelationships.py -i randomArticle.xml -b blockLog.csv
Returns the text introduced in each revision of any given deletion discussion ofrandomArticle.xml
.
Calculates the time between the creation of a post and when the author of said post has been blocked. This data is written as an additional column to the revision log
. This file is used by WikiWho.py
. Its standalone purpose is to migrate CSV-files from former WikiWho DiscussionParser revisions.
[block log]
(the block log constructed from the Wikipedia data dumps' logging dump through 0nse/WikiParser.)[revision log file]
(processed revisions with authorship but without calculated time until the next block.)[output file]
(the file to where the new CSV file should be written to.)
Reads Wikipedia dumps and writes uncompressed XML dumps that only contain Articles for Deletion (AfD), user pages or user talk pages of registered users. With preprocessing AfD, the actual WikiWho DiscussionParser can run a lot faster as it neither has to decompress the dumps nor filter for the relevant articles.
[page dump path]
(the path to the dumps. It can also be a concrete file. The filtered file will be generated with the suffix_afd.xml
or_users.xml
for AfD or user pages respectively.)-c [<condition>]
(optional. It can beisDeletionDiscussion
for AfD,isRegisteredUser
for users orisRegisteredUserTalk
for user talk pages. The default isisDeletionDiscussion
.
The scripts
directory contains multiple scripts for post-processing the data generated by WikiWho. It includes the option to run multiple classifiers on the data. More information is given by the README in the scripts
directory itself.
This work is released under a GPLv3 licence. It is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
For more information, please refer to the LICENSE file which you should have received with your copy of the program. If this is not the case, please refer to http://www.gnu.org/licenses/.