Tools, ideas, and data.
Semantics: EQUELLA objects are items with attachments. Invenio objects are records with files. EQUELLA has taxonomies; Invenio has vocabularies. We use these terms consistently so it's clear what format an object is in (e.g. python migrate/record.py item.json > record.json converts an item into a record).
uv install # get dependencies, takes awhile due to spacy's en_core_web_lg model
uv run pytest -v migrate/tests.py # run testsMigrate scripts that create records require an INVENIO_TOKEN or TOKEN variable in our environment or .env file. To create a token: sign in as an admin and go to Applications > Personal access tokens.
Invenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.
The taxos dir contains exported EQUELLA taxonomies and tools for working with them. The vocab dir contains YAML files for Invenio vocabularies.
Notable scripts that create Invenio vocabularies:
- taxos/users.py creates the names.yaml and users.yaml fixtures
- taxos/roles.py creates the Invenio relator
creatorsrolesandcontributorsrolesin a file named roles.yaml
We create two subject vocabularies: one for Library of Congress subjects with URIs from one of their authorities and one for CCA local subjects not present in any LC authority.
Download our subjects sheet and run python migrate/mk_subjects.py data/subjects.csv to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used to convert the text of VAULT subject terms into Invenio identifiers or ID-less keyword subjects.
Copy the YAML vocabularies into the app_data/vocabularies directory of our Invenio instance. The site needs to be rebuilt to load the changes (invenio-cli services destroy and then invenio-cli services setup again). Eventually (Invenio v12) there will be a CLI command to alter vocabularies without rebuilding the site.
- migrate/record.py: Converts EQUELLA item JSON into Invenio record JSON
- migrate/api.py: Converts an item and
POSTs it to Invenio to create a record - migrate/import.py: Imports an item directory (created by the export tool) with its attachments to Invenio
To use these scripts, we must create a personal access token for an administrator account in Invenio:
- Sign in as an admin
- Go to Applications > Personal access tokens
- Create one—its name and the
user:emailscope (as of v12) do not matter - Copy it to clipboard and Save
- Paste in .env and/or set it as an env var, e.g.
set -x INVENIO_TOKEN=xyzin fish
Below, we migrate a VAULT item to an Invenio record and post it to Invenio.
set -x INVENIO_TOKEN your_token_here && set -x HOST 127.0.0.1:5000
python migrate/api.py items/item.json # example output below
HTTP 201
https://127.0.0.1:5000/api/records/k7qk8-fqq15/draft
HTTP 202
{"id": "k7qk8-fqq15", "created": "2024-05-31T15:26:17.972009+00:00", ...
https://127.0.0.1:5000/records/k7qk8-fqq15You can sometimes trip over yourself if the .env file in the project root is loaded and contains an outdated personal access token. If API calls fail with 403 errors, check that the TOKEN or INVENIO_TOKEN variable is set correctly.
Rerunning a "migrate" script with the same input creates a new record, it doesn't update the existing one.
We could write scripts to directly take an item from EQUELLA using its API, perform a metadata crosswalk, and post it to Invenio. Alternatively, we could work with local copies of items, perhaps created by the equella_scripts collection export tool.
We need to load the necessary fixtures, including user accounts, before adding to Invenio. For instance, the item owner needs to already be in Invenio before we can add them as owner of a record. If we attempt to load a record with a subject id that doesn't exist yet, we get a 500 error.
We download metadata for all items using equella-cli and a script like this:
#!/usr/bin/env fish
set total (eq search -l 1 | jq '.available')
set length 50 # can only download up to 50 at a time
set pages (math floor $total / $length)
for i in (seq 0 $pages)
set start (math $i x $length)
echo "Downloading items $start to" (math $start + $length)
# NOTE: no attachment info, use "--info all" for both attachments & metadata
eq search -l $length --info metadata --start $start > json/$i.json
endWe can use the item.metadata XML of existing VAULT items for testing. Generally, python migrate/record.py items/item.json | jq to see the JSON Invenio record. See our crosswalk diagrams.
Schemas:
It's likely our schema is outdated/inaccurate in places.
How to map a field:
- Add a brief description to the mermaid diagram in docs/crosswalk.html
- Write a test in tests.py with your input XML and expected record output
- Write a
Recordmethod in migrate.py & use it in theRecord::get()dict - Run tests, optionally run a record migration as described above