ProtoQA Dataset

This repository contains the dataset for ProtoQA ("Family Feud"). See the paper for details on dataset creation.

Data Files:

Each line is a json dictionary, in which:

question contains the question (in original and a normalized form)
answers (where available) contains:
- raw original answers provided by survey respondents (when available) with their counts
- clusters which include the score for each cluster and the strings included in that cluster

For a full description of the data format, see DATAFORMAT.md.

data/train/train.jsonl: 8781 instances for training or fine-tuning scraped from Family Feud fan sites (see paper). Scraped data has answer clusters with sizes, but only has a single string per cluster (corresponding to the original cluster name.
data/dev/dev.scraped.jsonl: 979 instances sampled from the same Family Feud data, for use in model validation and development.
data/dev/dev.crowdsourced.jsonl: 51 questions collected with exhaustive answer collection and manual clustering, matching the details of the eval test set (roughly 100 human answers per question).
data/test/test.questions.jsonl 102 questions for evaluation. (Note that the test set contains questions only.)

This repository contains a data statement (based on Datasheets for Datasets (Gebru et al. 2020) and earlier NLP-specific work (Bender and Friedman 2018)) to provide transparency in data use and encourage others to do so. This is a preliminary version of the statement; please post issues in the repository or contact the authors if you have questions regarding the data details or suggestions regarding the dataset use.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitignore		.gitignore
DATAFORMAT.md		DATAFORMAT.md
DATASTATEMENT.md		DATASTATEMENT.md
LICENSE		LICENSE
README.md		README.md