This project is an attempt to convert the annotations compiled by the Tang dynasty scholar Lu Deming (陸德明) in the Jingdian Shiwen (经典释文) into a structured form that separates phonology, glosses, and references to secondary sources. A spaCy pipeline is configured to parse and tag the annotations, and prodigy is used for guided annotation of the training data. The project is part of a broader effort to build a linguistic model of Old Chinese (上古漢語) that incoporates phonology.
The Jingdian Shiwen comprises Lu's annotations on most of the "Thirteen Classics" (十三經) of the Confucian tradition, as well as some Daoist texts. We use the edition of the Jingdian Shiwen found in the Collectanea of the Four Categories (四部叢刊), which includes high-quality lithographic reproductions of many ancient texts. The annotations given in the Jingdian Shiwen are paired with the source texts to which they apply; for this we predominantly use the definitive (正文) editions published by the Kanseki Repository.
work | title | source | Jingdian Shiwen chapters (卷) |
---|---|---|---|
周易 | Book of Changes | KR1a0001 | 2 |
尚書 | Book of Documents | KR1b0001 | 3-4 |
毛詩 | Mao Commentary on the Book of Odes | KR1c0001 | 5-7 |
周禮 | Rites of Zhou | KR1d0001 | 8-9 |
儀禮 | Etiquette and Ceremonial | CH1e0873* | 10 |
禮記 | Book of Rites | KR1d0052 | 11-14 |
春秋左傳 | Commentary of Zuo on the Spring and Autumn Annals | KR1e0001 | 15-20 |
春秋公羊傳 | Commentary of Gongyang on the Spring and Autumn Annals | CH1e0877* | 21 |
春秋穀梁傳 | Commentary of Guliang on the Spring and Autumn Annals | KR1e0008 | 22 |
孝經 | Classic of Filial Piety | KR1f0001 | 23 |
論語 | Analects of Confucius | KR1h0004 | 24 |
老子 | Laozi | KR5c0057 | 25 |
莊子 | Zhuangzi | KR5c0126 | 26-28 |
*This data is sourced with permission from the China Ancient Texts (CHANT) database.
We omit chapter 1 of the Jingdian Shiwen, corresponding to the Erya (爾雅). All digital sources have been preprocessed to remove punctuation, whitespace, and non-Chinese characters. Kanseki Repository data is generously licensed CC-BY.
After processing, the labeled output data is saved in JSON-lines (.jsonl
) format, to be used for machine learning, natural language processing, and other computational applications.
To annotate training data, you need to have spacy installed in your python environment:
pip install spacy
You also need a copy of prodigy. Once you have the appropriate wheel, install it with:
# example: prodigy version 1.11.8 for python 3.10 on windows
pip install prodigy-1.11.8-cp310-cp310-win_amd64.whl
Then, verify the project assets are downloaded:
spacy project assets
Install python dependencies needed for annotation:
spacy project run install
Then, choose a task (see "commands" below). Invoke it with e.g.:
# annotate data by correcting predictions
spacy project run annotate
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
install |
Install dependencies |
annotate-spans |
Annotate spans by correcting predictions based on heuristics |
export |
Export training data from prodigy's database for use with spaCy |
train |
Train a spaCy pipeline |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/docs.csv |
Local | Table mapping each chapter in a source text to its location in the Jingdian Shiwen |
assets/variants.json |
Local | Equivalency table for graphic variants of characters |
assets/treebank |
Git | Universal Dependencies treebank for Classical Chinese |
Parameter | Description |
---|---|
embedding |
Choose an embedding layer implementation (spaCy's Tok2Vec or Transformer) |
suggester |
Choose between two span suggester architectures (SpanFinder, Ngram) |
tranformer_model_name |
Choose a transformer model from HuggingFace (if using Transformer as the embedding layer) |
gpu_id |
Choose whether you want to use your GPU (device number) or CPU (-1) |