Researcher: Wen Wang (Visiting PhD Scholar, Harvard University)
Affiliation: China Biographical Database Project (CBDB)
This project aims to develop a structured and reproducible workflow for extracting prosopographical data from premodern Chinese biographical texts.
The primary goal is to transform unstructured narrative biographies into structured data compatible with the core relational tables of the China Biographical Database (CBDB), including personal information, social relations, offices, kinship relations, and textual records.
The project emphasizes methodological transparency, data traceability, and strict source fidelity.
This repository focuses on:
- Prosopographical data extraction from historical Chinese biographical texts
- Schema-driven information modeling aligned with CBDB data structures
- Evaluation of large language models (LLMs) for structured historical data extraction
- Documentation of extraction accuracy, ambiguity, and edge cases
The project does not aim to replace existing CBDB editorial workflows, but rather to explore computationally assisted methods that may support future data expansion and quality control.
Current sources under investigation include:
- 《中国文学家大辞典·唐五代卷》
- Selected premodern Chinese biographical and literary reference works
All extracted data are strictly derived from the original texts.
No external historical knowledge is introduced during the extraction process.
The project follows several core principles:
-
Source-bound extraction
All extracted information must be explicitly attested in the given text. -
Schema-first modeling
Extraction is guided by predefined field schemas aligned with CBDB tables. -
No inferential completion
Missing information is not inferred or supplemented using external knowledge. -
Explicit uncertainty marking
Ambiguous or unclear expressions are documented rather than normalized. -
Reproducibility
All prompts, schemas, and scripts are documented to ensure repeatability.
The extraction framework is designed to align with the following CBDB tables:
BIOG_MAINENTRY_DATAASSOC_DATAKIN_DATAPOSTED_TO_OFFICE_DATABIOG_TEXT_DATA
Mapping details and field definitions are documented in the methodology files.
cbdb-prosopographical-extraction/ │ ├── data/ │ ├── raw/ # Original source texts │ ├── cleaned/ # Preprocessed texts │ └── extracted/ # Structured extraction results │ ├── scripts/ │ ├── prompts/ # Extraction prompts and few-shot templates │ ├── python/ # Processing scripts │ └── sql/ # CBDB-related SQL queries │ ├── documentation/ │ ├── methodology.md # Extraction framework and design rationale │ ├── data_schema.md # Field definitions and mappings │ └── progress_log.md # Research progress log │ └── outputs/ ├── csv/ └── evaluation/
- Repository initialization
- Definition of extraction scope
- Field schema formalization
- Prompt framework construction
- Batch extraction experiments
- Accuracy evaluation and error analysis
Progress updates are recorded in documentation/progress_log.md.
- A documented prosopographical extraction workflow
- Reusable prompt and schema templates
- Empirical evaluation of LLM-assisted extraction performance
- Structured datasets suitable for CBDB review and discussion
Wen Wang
Visiting Scholar, Harvard University
China Biographical Database Project (CBDB)