discussions
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
File 'discussions_text.csv': each line represents a comment or structural element of the discussion, with the associated metadata. Description of all the fields in the file: global_id -> global id of the node in the dataset (each ID is unique in the file) parent_global_id-> global id of the parent node in the dataset id -> local id of the node in the discussion (each id is unique within the article. id 0 represents the article itself) parent_id -> local id of the parent node in the discussion level -> level of indentation (negative numbers represent structural nodes, level 1 represents comments which are not a reply to another comment) article -> id of the article to which the discussion is associated (see list of ids and corresponding titles in file "article_titles.csv"). Negative numbers represent articles for which we do not know the official id in Wikipedia discussion -> id of the talk page in which the comment was written (there can be more talk pages associated to the same article) timestamp -> UNIX timestamp in minutes: multiply for 60 to get the value in seconds (the standard epoch timestamp). 0 for structural nodes, -1 for undated comments day -> day, starting from day 1 = January 1st, 2001. 0 for structural nodes, -1 for undated comments author -> author's id. 0: no author (structural nodes) -1: unsigned comment >0 & <16M: user id generated by the software (not the official Wikipedia user id) >16M: ip-signed comments parent_author -> id of the parent comment's author author_name -> author's user name parent_author_name -> user name of the parent comment's author date -> date (string) text -> text of the comment. To write one comment per line, newlines (\n) have been replaced with " <LF> " and tabs (\t) with " <TAB> ". Structural nodes: texts delimited by <> represent articles' and talk pages' titles, while texts delimited by "=" represent thread titles. The other texts represent comments, where signature and date have been removed.