Skip to content

Latest commit

 

History

History

discussions

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
File 'discussions_text.csv': each line represents a comment or structural element of the discussion, with the associated metadata. 
Description of all the fields in the file:

global_id	-> global id of the node in the dataset (each ID is unique in the file)       
parent_global_id-> global id of the parent node in the dataset
id      	-> local id of the node in the discussion (each id is unique within the article. id 0 represents the article itself)
parent_id	-> local id of the parent node in the discussion       
level   	-> level of indentation (negative numbers represent structural nodes, level 1 represents comments which are not a reply to another comment)
article 	-> id of the article to which the discussion is associated (see list of ids and corresponding titles in file "article_titles.csv"). Negative numbers represent articles for which we do not know the official id in Wikipedia
discussion      -> id of the talk page in which the comment was written (there can be more talk pages associated to the same article)
timestamp       -> UNIX timestamp in minutes: multiply for 60 to get the value in seconds (the standard epoch timestamp). 0 for structural nodes, -1 for undated comments
day     	-> day, starting from day 1 = January 1st, 2001. 0 for structural nodes, -1 for undated comments
author  	-> author's id. 
			0: no author (structural nodes) 
			-1: unsigned comment 
			>0 & <16M: user id generated by the software (not the official Wikipedia user id)
			>16M: ip-signed comments
parent_author   -> id of the parent comment's author
author_name     -> author's user name
parent_author_name -> user name of the parent comment's author
date    	-> date (string)
text		-> text of the comment. To write one comment per line, newlines (\n) have been replaced with "	<LF>	" and tabs (\t) with "  <TAB>  ". Structural nodes: texts delimited by <> represent articles' and talk pages' titles, while texts delimited by "=" represent thread titles. The other texts represent comments, where signature and date have been removed.