Version 0.1.1
PROV OAI Harvester is a Ruby script designed to harvest metadata records from OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) endpoints. It specifically targets the Public Record Office Victoria (PROV) OAI-PMH endpoint but can be adapted for other OAI-PMH services.
- Harvests metadata records from OAI-PMH endpoints
- Supports resumption tokens for paginated requests
- Saves raw XML responses
- Generates a combined, sorted XML file of all records
- Produces a sorted JSON file of all records
- Can process previously saved XML files instead of making new requests
- Ruby 2.7 or higher
- Nokogiri gem
- OpenURI gem
- Other standard Ruby libraries (uri, json, date, fileutils, optparse)
- Ensure you have Ruby installed on your system.
- Install the required gem:
bundle install
Run the script from the command line:
bundle exec ruby harvest-prov-oai.rb [options]
-s, --save-raw-xml DIR
: Save raw XML to the specified directory.-u, --use-saved-raw-xml DIR
: Use saved XML from the specified directory instead of making new requests.--split
: Split into separate files for agencies, functions and series-h, --help
: Print help information.-v, --version
: Print version information.
To harvest records from the PROV OAI-PMH endpoint:
bundle exec ruby harvest-prov-oai.rb
To process previously saved XML files:
bundle exec ruby harvest-prov-oai.rb -u /path/to/saved/xml/directory
The script generates the following outputs:
- A JSON file containing all harvested records, sorted by identifier.
- An XML file containing all harvested records, sorted by identifier.
Output files are named with the current date, e.g., prov-oai-2024-10-10.json
.
If --save-xml DIR
is specified it will also create that DIR and save the raw XML responses inside it.
To use this script with a different OAI-PMH endpoint, modify the base_url
variable in the script:
base_url = 'http://your-oai-endpoint-url'
This project is open source and is licensed under the Apache License, Version 2.0, (LICENSE or https://www.apache.org/licenses/LICENSE-2.0).