This repository contains chapters (randomly sampled) from novels. The novels themselves are randomly sampled (without replacement) from two populations:
- Novels published in the British Isles for the first time between 1837 and 1901 (Random Canon). The sampling frame is restricted to novels for which there exist page images of the first edition as of 2018-12-31 11:59:59 UTC. Internet Archive, Google Books/HathiTrust, and the British Library are the only sources which are checked for page images.
- Novels published in the British Isles for the first time between 1837 and 1901 which are still available ("in print") in 2017 from Penguin, Oxford, or Broadview (Reprint Canon).
The encoding procedure used is described in novel-encoding-instructions.md
.
random_canon.csv
contains metadata about the Random Canon titles.reprint_canon.csv
contains metadata about the Reprint Canon titles.other_novels.csv
contains metadata about texts included in the repository incidentally. These are not part of either Canon. These novels survive but there are no page scans of the first edition as of the end of 2018.scripts/quality_control_checks.py
checks to see if a text has been properly encoded.supplementary-materials
contains data used to construct the random samples.texts
contains chapters. Filenames begin with ATCL title ids. Texts are encoded using HTML5.novel-encoding-instructions.md
describes the encoding process.adding-an-encoded-novel.md
describes how to add a novel to this repository.
- Curly quotation marks (
‘…’
and“…”
) are not always entered correctly. Analyses should not require distinguishing between, say,“
and”
. Counts of puncutation marks per sentence or paragraph, however, should be reliable.
All texts in this repository are in the public domain.