Skip to content

Hosts the code and data to support LLM-based subject indexing of the TIB technical library's collection of traffic reports

License

Notifications You must be signed in to change notification settings

jd-coderepos/traffic-llms4subjects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data preprocessing scripts

  1. read the raw data and extract the individual records

scripts/traffic-records-xml-parser.py --input raw-data-folder --output data

Data statistics generation scripts

  1. read the individual records and print the unique counts for document genre/type combination

scripts/print-document-type-stats.py --input data

  1. read the individual records and print the unique gnd subjects and their occurrence counts

scripts/count-gnd-subjects.py --input data --output data-stats/gnd_subject_counts.csv

  1. GND subject mapping and frequency analysis -- This task extracts and counts gnd subjects from XML files, maps them to the LLMs4Subjects human-readable GND taxonomy for classification validation, and outputs separate files for matched and unmatched entries with occurrence frequencies.

scripts/validate-subjects-in-gnd.py

About

Hosts the code and data to support LLM-based subject indexing of the TIB technical library's collection of traffic reports

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages