Supplementary materials for: Are online news comments like face-to-face conversation? A multi-dimensional analysis of an emerging register (version 1.0)
https://zenodo.org/badge/latestdoi/224676981
This repository contains the original data, script, and additional statistical material for conducting a multi-dimensional analysis of online news comments vis-à-vis other traditional English registers as described in
- Ehret, Katharina, and Maite Taboada. (2020). “Are online news comments like face-to-face conversation? A multi-dimensional analysis of an emerging register”. Register Studies, 2 (1): 1-36. DOI: 10.1075/rs.19012.ehr
This publication focuses on the question of whether online news comments are like face-to-face conversation or not. It is a widespread view that online comments are like "dialogue", with comments often being referred to as "conversations". These assumptions, however, lack empirical back-up. In order to answer this question, register-relevant properties of online news comments are systematically explored using multi-dimensional analysis (MDA) techniques. Specifically, MDA is used to establish what online comments are like by describing their linguistic features and comparing them to traditional registers (e.g. face-to-face conversation, academic writing).
The dataset published in this repository is based on comments and opinion articles of the Simon Fraser University opinion and comments corpus (SOCC) as well as the Canadian component of the International Corpus of English (ICE). The original corpus data was annotated with parts-of-speech tags using the Multidimensional Analysis Tagger and tags were retrieved with a custom-made python script (available from https://github.com/sfu-discourse-lab/MDA_project). A detailed description of the data (henceforth called SOCC-ICE dataset) and statistical procedure, i.e. factor analysis, can be found in the related publication.
This csv contains the unrotated eigenvalues which were calculated based on the correlation matrix of normalised feature frequencies.
This csv contains the factor scores of a six factor solution for each of the 660 individual text in the SOCC-ICE dataset. The first column contains the file names.
This csv contains normalised feature frequencies of 67 lexico-grammatical features (see POS-tag_description) which were automatically retrieved from each text in the SOCC-ICE dataset. The frequencies were normalised per 1000 word tokens. This csv serves as input for the factor analysis.
This csv contains the raw feature frequencies of 67 lexico-grammatical features (see POS-tag_description) which were automatically retrieved from each text in the SOCC-ICE dataset.
The feature loadings of each feature on each factor in the six factor solution. The first column lists the feature tags (see POS-tag_description.csv, for a description of the features).
A csv containing the mean factor scores for each register in the SOCC-ICE dataset. The mean factor scores were calculated based on the factor scores of the 660 individual text in the dataset. Mean factor scores are essentially an average across all texts belonging to one register (see texts_per_register.csv).
The standard deviation of the mean factor scores per register.
A list of 67 lexico-grammatical features. The first column provides the tag, the second column gives a brief description of the feature.
The r script containing the commands for conducting all statistics and the factor analysis described in the related publication.
A list of all individual texts in the SOCC-ICE dataset (first column) and information on their register (second column) and mode of communication (third column). This file is necessary for calculating mean factor scores per register.