Skip to content

A research project for extracting structured prosopographical data from premodern Chinese biographical texts for CBDB.

Notifications You must be signed in to change notification settings

cbdb-project/cbdb-prosopographical-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

CBDB Prosopographical Extraction Project

Researcher: Wen Wang (Visiting PhD Scholar, Harvard University)
Affiliation: China Biographical Database Project (CBDB)


Project Overview

This project aims to develop a structured and reproducible workflow for extracting prosopographical data from premodern Chinese biographical texts.

The primary goal is to transform unstructured narrative biographies into structured data compatible with the core relational tables of the China Biographical Database (CBDB), including personal information, social relations, offices, kinship relations, and textual records.

The project emphasizes methodological transparency, data traceability, and strict source fidelity.


Research Scope

This repository focuses on:

  • Prosopographical data extraction from historical Chinese biographical texts
  • Schema-driven information modeling aligned with CBDB data structures
  • Evaluation of large language models (LLMs) for structured historical data extraction
  • Documentation of extraction accuracy, ambiguity, and edge cases

The project does not aim to replace existing CBDB editorial workflows, but rather to explore computationally assisted methods that may support future data expansion and quality control.


Primary Sources

Current sources under investigation include:

  • 《中国文学家大辞典·唐五代卷》
  • Selected premodern Chinese biographical and literary reference works

All extracted data are strictly derived from the original texts.
No external historical knowledge is introduced during the extraction process.


Methodological Principles

The project follows several core principles:

  1. Source-bound extraction
    All extracted information must be explicitly attested in the given text.

  2. Schema-first modeling
    Extraction is guided by predefined field schemas aligned with CBDB tables.

  3. No inferential completion
    Missing information is not inferred or supplemented using external knowledge.

  4. Explicit uncertainty marking
    Ambiguous or unclear expressions are documented rather than normalized.

  5. Reproducibility
    All prompts, schemas, and scripts are documented to ensure repeatability.


Target CBDB Tables

The extraction framework is designed to align with the following CBDB tables:

  • BIOG_MAIN
  • ENTRY_DATA
  • ASSOC_DATA
  • KIN_DATA
  • POSTED_TO_OFFICE_DATA
  • BIOG_TEXT_DATA

Mapping details and field definitions are documented in the methodology files.


Repository Structure

cbdb-prosopographical-extraction/ │ ├── data/ │ ├── raw/ # Original source texts │ ├── cleaned/ # Preprocessed texts │ └── extracted/ # Structured extraction results │ ├── scripts/ │ ├── prompts/ # Extraction prompts and few-shot templates │ ├── python/ # Processing scripts │ └── sql/ # CBDB-related SQL queries │ ├── documentation/ │ ├── methodology.md # Extraction framework and design rationale │ ├── data_schema.md # Field definitions and mappings │ └── progress_log.md # Research progress log │ └── outputs/ ├── csv/ └── evaluation/


Current Status

  • Repository initialization
  • Definition of extraction scope
  • Field schema formalization
  • Prompt framework construction
  • Batch extraction experiments
  • Accuracy evaluation and error analysis

Progress updates are recorded in documentation/progress_log.md.


Expected Outcomes

  • A documented prosopographical extraction workflow
  • Reusable prompt and schema templates
  • Empirical evaluation of LLM-assisted extraction performance
  • Structured datasets suitable for CBDB review and discussion

Contact

Wen Wang
Visiting Scholar, Harvard University
China Biographical Database Project (CBDB)

About

A research project for extracting structured prosopographical data from premodern Chinese biographical texts for CBDB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published