CBDB Prosopographical Extraction Project

Researcher: Wen Wang (Visiting PhD Scholar, Harvard University)
Affiliation: China Biographical Database Project (CBDB)

Project Overview

This project aims to develop a structured and reproducible workflow for extracting prosopographical data from premodern Chinese biographical texts.

The primary goal is to transform unstructured narrative biographies into structured data compatible with the core relational tables of the China Biographical Database (CBDB), including personal information, social relations, offices, kinship relations, and textual records.

The project emphasizes methodological transparency, data traceability, and strict source fidelity.

Research Scope

This repository focuses on:

Prosopographical data extraction from historical Chinese biographical texts
Schema-driven information modeling aligned with CBDB data structures
Evaluation of large language models (LLMs) for structured historical data extraction
Documentation of extraction accuracy, ambiguity, and edge cases

The project does not aim to replace existing CBDB editorial workflows, but rather to explore computationally assisted methods that may support future data expansion and quality control.

Primary Sources

Current sources under investigation include:

《中国文学家大辞典·唐五代卷》
Selected premodern Chinese biographical and literary reference works

All extracted data are strictly derived from the original texts.
No external historical knowledge is introduced during the extraction process.

Methodological Principles

The project follows several core principles:

Source-bound extraction
All extracted information must be explicitly attested in the given text.
Schema-first modeling
Extraction is guided by predefined field schemas aligned with CBDB tables.
No inferential completion
Missing information is not inferred or supplemented using external knowledge.
Explicit uncertainty marking
Ambiguous or unclear expressions are documented rather than normalized.
Reproducibility
All prompts, schemas, and scripts are documented to ensure repeatability.

Target CBDB Tables

The extraction framework is designed to align with the following CBDB tables:

BIOG_MAIN
ENTRY_DATA
ASSOC_DATA
KIN_DATA
POSTED_TO_OFFICE_DATA
BIOG_TEXT_DATA

Mapping details and field definitions are documented in the methodology files.

Repository Structure

cbdb-prosopographical-extraction/ │ ├── data/ │ ├── raw/ # Original source texts │ ├── cleaned/ # Preprocessed texts │ └── extracted/ # Structured extraction results │ ├── scripts/ │ ├── prompts/ # Extraction prompts and few-shot templates │ ├── python/ # Processing scripts │ └── sql/ # CBDB-related SQL queries │ ├── documentation/ │ ├── methodology.md # Extraction framework and design rationale │ ├── data_schema.md # Field definitions and mappings │ └── progress_log.md # Research progress log │ └── outputs/ ├── csv/ └── evaluation/

Current Status

Repository initialization
Definition of extraction scope
Field schema formalization
Prompt framework construction
Batch extraction experiments
Accuracy evaluation and error analysis

Progress updates are recorded in documentation/progress_log.md.

Expected Outcomes

A documented prosopographical extraction workflow
Reusable prompt and schema templates
Empirical evaluation of LLM-assisted extraction performance
Structured datasets suitable for CBDB review and discussion

Contact

Wen Wang
Visiting Scholar, Harvard University
China Biographical Database Project (CBDB)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
documentation		documentation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CBDB Prosopographical Extraction Project

Project Overview

Research Scope

Primary Sources

Methodological Principles

Target CBDB Tables

Repository Structure

Current Status

Expected Outcomes

Contact

About

Uh oh!

Releases

Packages

cbdb-project/cbdb-prosopographical-extraction

Folders and files

Latest commit

History

Repository files navigation

CBDB Prosopographical Extraction Project

Project Overview

Research Scope

Primary Sources

Methodological Principles

Target CBDB Tables

Repository Structure

Current Status

Expected Outcomes

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages