The goal of Iridium is to provide an easy-to-use Python interface for interacting with InvenioRDM.
InvenioRDM has a REST API, but this API is "[...] intended for advanced users, and developers of InvenioRDM that have some experience with using REST APIs [...]".
Iridium is intended for everyone else who needs a programmatic interface to InvenioRDM, e.g.:
- domain researchers and data scientists with basic user-level Python competence who would like to use an environment such as Jupyter notebooks e.g. in order to reuse or analyse available data.
- developers who need to build lightweight external tooling around InvenioRDM e.g. as a part of bigger domain-specific solutions and workflows that use InvenioRDM as the underlying repository.
In this tutorial you will learn how to use most of the Iridium interface.
For a deeper look, consider looking into the more technical documentation of
classes in the iridium
package.
IMPORTANT: in order to use all the APIs, you need to get an API token from the InvenioRDM you are going to use. For this, sign in to your InvenioRDM, and go to Settings -> Applications -> Personal access tokens to create one. Without a token you will only be able to have read-only access to published records.
Start with the following imports:
from iridium import Repository
from iridium.inveniordm.models import *
A Repository
object represents the top-level entry point from which all exposed
functionality can be accessed. Get access to the InvenioRDM instance:
rdm = Repository.connect("https://www.your-invenio-rdm.org", "YOUR_API_TOKEN")
Remark: If you are using a test instance of InvenioRDM that uses a self-signed
certificate, you need to pass an extra argument verify=False
to the connect
method.
Otherwise the connection will fail due to security reasons.
In Invenio, the workflow to update records always goes through record drafts. Create a draft for a new record like this:
draft = rdm.drafts.create()
When you print or evaluate draft
in your notebook, you will see something like:
{ 'access': { 'embargo': {'active': False},
'files': 'public',
'record': 'public',
'status': 'metadata-only'},
'created': '2022-02-14T10:21:36.081522+00:00',
'expires_at': '2022-02-14T10:21:36.081560',
'files': [],
'id': 'k86r9-7b355',
'is_published': False,
'metadata': {},
'updated': '2022-02-14T10:21:36.100746+00:00',
'versions': {'index': 1, 'is_latest': False, 'is_latest_draft': True}}
Remark: This is a slightly censored view at what InvenioRDM stores about drafts and records. Iridium will hide some fields that are confusing or too technical (you can still access them, if you know what you are doing).
Both save()
and publish()
work exactly the same as you know it from the web interface.
This means, you must save()
changes you do to the metadata, otherwise they are lost
once you get rid of your draft object.
Also, changes you do to drafts are visible only to you until you publish()
the draft.
Note that publish()
will also automatically save()
your changes.
We can try publishing the draft without adding any metadata:
draft.publish()
We will get back a number of validation errors from InvenioRDM:
{ 'files.enabled': 'Missing uploaded files. To disable files for this record '
'please mark it as metadata-only.',
'metadata.creators': 'Missing data for required field.',
'metadata.publication_date': 'Missing data for required field.',
'metadata.resource_type': 'Missing data for required field.',
'metadata.title': 'Missing data for required field.'}
In order to fix the first problem, you either have to set draft.files.enabled = False
,
thus confirming that you want to create a metadata-only record, or you can add at least
one file. We will take the second option here.
To upload and attach my_file.zip
to the draft, run:
draft.files.upload("my_file.zip")
Remark: If you don't have a file stored on disk or want to store it under a different
name in the record, you can use draft.files.upload("target_filename.zip", data)
,
where data
can be an arbitrary binary stream. To create a suitable stream from a file,
use open(PATH_TO_FILE, "rb")
. For example,
draft.files.upload("renamed.zip", open("my_file.zip", "rb"))
would upload the same file
as above, but save it as renamed.zip
in the draft.
Now let us inspect draft.files
:
{'my_file.zip': FileMetadata(...)}
We see that the new file is registered in the draft. We can also access the information
that is stored alongside the uploaded file by inspecting draft.files["my_file.zip"]
:
{ 'bucket_id': '263bcf0e-f74e-4d98-ab3a-3560b30c4c8b',
'checksum': 'md5:1a6954f71cb8e867c6ea67b1d01c725b',
'created': '2022-02-14T11:01:24.484029+00:00',
'file_id': 'b529e0c8-ce4b-4213-8d4b-ef3744aa4a5b',
'key': 'my_file.zip',
'links': { 'commit': 'https://127.0.0.1:5000/api/records/k86r9-7b355/draft/files/README.md/commit',
'content': 'https://127.0.0.1:5000/api/records/k86r9-7b355/draft/files/README.md/content',
'self': 'https://127.0.0.1:5000/api/records/k86r9-7b355/draft/files/README.md'},
'metadata': {},
'mimetype': 'text/markdown',
'size': 2686,
'status': 'completed',
'storage_class': 'S',
'updated': '2022-02-14T11:01:24.590777+00:00',
'version_id': 'e5b11c0a-60c3-42f7-be26-d332cd776310'}
The most interesting information is probably the checksum
, that you can use to verify
that no file corruption happened during upload (you could e.g. run md5sum
on your file
in the terminal and compare the checksum strings - they must be equal).
In order to modify access restrictions or bibliographic metadata of a draft,
you can edit the draft.access
and draft.metadata
fields directly
(don't forget to save()
afterwards).
Now let us add the missing information InvenioRDM was complaining about.
Iridium does not hide away or simplify the internal metadata model, but it provides
classes that help you constructing the required entities (that is why we imported
iridium.inveniordm.models
at the start).
from datetime import date
# add an arbitrary title
draft.metadata.title = "My amazing new dataset"
# publication date must be of the shape YYYY[-MM][-DD]
draft.metadata.publication_date = date.today().isoformat() # e.g.: 2022-10-20
# you can check the existing types with list(rdm.vocabulary[VocType.resource_types])
draft.metadata.resource_type = VocabularyRef(id="dataset")
draft.metadata.creators = [
Creator(
role=CreatorRole(id="contactperson"),
affiliations=[Affiliation(name="CERN")],
person_or_org=PersonOrOrg(family_name="Doe", given_name="John", type="personal"))
]
draft.save() # should return no validation errors now!
draft.publish()
Notice that after publish()
succeeds, the object we called draft
actually
becomes a non-draft record object. You can check its record id in draft.id
.
Now let us verify that our new published record can be accessed:
rec = rdm.records[draft.id]
print(rec.metadata.title)
You should see My amazing new dataset
printed out to you.
But what if we notice that we did a mistake? If the mistake was only in the metadata and not in the files, then we can easily fix it. For example, let us change the title of the record that we created:
rec = rdm.records[draft.id]
rec.edit()
rec.metadata.title = "My corrected new dataset"
rec.publish()
So we access the record, set it into editable mode (technically, we switch to a draft),
update the metadata and publish()
the changes - that's it.
If our mistake is in the files that we uploaded, though, there is some more work involved. InvenioRDM only allows to update the files attached to a record if we create a new version of that record. The old files will forever remain accessible in the previous versions.
Currently, our fresh record has just one version:
print(len(rec.versions)) # should print: 1
print(rec.versions) # should print a list containing just the value of rec.id
Now let us create a new version (which is another draft, but one linked to the previous version):
rec_new = rec.versions.create()
We can use rec_new.save()
to get the validation errors:
{ 'files.enabled': 'Missing uploaded files. To disable files for this record '
'please mark it as metadata-only.',
'metadata.publication_date': 'Missing data for required field.'}
From this you learn two things:
- In a new version, the publication date is removed from the draft, forcing you to consciously update it.
- By default, the files from the previous version are not included in the new version.
If you want to keep (a subset of) the files from the previous version in the record,
use rec_new.files.import_old()
to import them into the draft, so you will not have to
upload them again. You can use rec_new.files.delete(filename)
to remove such imported
files as well as any files that you uploaded into an unpublished draft.
To add a new file, proceed as described above, e.g. rec_new.files.upload("other.data")
.
Now you can set the new publication date and publish the new version.
After this there should be two versions of your record and
rec.versions.latest()
should point to the new version, i.e.
rec.versions.latest().id
now equals rec_new.id
.
Most entities that can be queried conform to one common interface -
they are subclasses of iridium.generic.AccessProxy
.
This interface unifies dict-like retrieval of individual entities and list-like access
to query results into one convenient object.
In the following we will look at how you can use this for records (via rdm.records
)
as an example, but access to most other queryable entities such as:
- record drafts (
rdm.drafts
), - record versions (
rdm.records[rec_id].versions
) - record access links (
rdm.records[rec_id].access_links
) - and vocabularies (
rdm.vocabulary[VocType.some_vocabulary]
) works in the same way, the only difference being the query parameters that are accepted.
You have already seen that you can access records by their id with
rdm.records["record_id"]
. On failure, you will get a KeyError
, like when using a dict.
You can also use "record_id" in rdm.records
to check whether a record with the
corresponding id exists, just like you can check for the existence of a key in a dict.
When you use rdm.records
as the Iterable
for a for
-loop, you will iterate through
all existing public records, for example you could list all record titles like this:
for rec in rdm.records:
print(rec.metadata.title)
Using rdm.records
like this is effectively a query without any filters. To apply
filters, you can pass query parameters as call arguments to this object:
for rec in rdm.records(q="amazing"):
print(rec.metadata.title)
The q
parameter is corresponding to the "search text field" in the web interface and
supports both free-text queries and special ElasticSearch syntax (see
here).
You can also apply filters you know from the web interface, e.g. the query
rdm.records(resource_type=["dataset", "publication"])
will return only records with one of the two specified resource types, and
rdm.records(access_status=["metadata-only"])
will return only records that have no files attached.
Internally, InvenioRDM does not return all results at once, but paginates them, i.e. groups them into pages of a fixed size and allows to request these pages. While this is a natural fit for a web interface, this is a technical detail you most likely don't want to think much about. Therefore, Iridium will take care of loading result pages on demand for you. A result page is only loaded once your code wants to access the corresponding result.
This is especially useful if you e.g. want to access the first couple of results out of a
thousand. But if you in fact want to traverse all results, the automatic page size might
be suboptimal, slowing you code down (due to a larger number of network requests).
Therefore, in most query interfaces you can provide the parameter size
to control the
number of results that are loaded "at once" so you can adjust this to your use-case:
for rec in rdm.records(q="amazing", size=200):
print(rec.metadata.title)
does the same as above, but your code will load 200 results at once,
instead of the default of 10 of the InvenioRDM API. By default, Iridium keeps 10 as the
default for record queries (where often you might want only the "top results"), but
increases the page size for vocabulary queries. You can experiment with the size
argument and find out what works best for you.
InvenioRDM provides a number of vocabularies that can be queried.
See the VocType
class for a list of the supported vocabularies.
For example, we can print all software licenses as follows:
for l in rdm.vocabulary[VocType.licenses](tags="software"):
print(l.id)
Or we can see how many listed languages are extinct:
print(len(rdm.vocabulary[VocType.languages](tags="extinct")))
Or we can look at specific entries in more detail:
print(rdm.vocabularies[VocType.resource_types]["dataset"])
resulting in an object like this:
{ 'created': '2022-01-11T09:15:42.699516+00:00',
'icon': 'table',
'id': 'dataset',
'links': { 'self': 'https://127.0.0.1:5000/api/vocabularies/resourcetypes/dataset'},
'props': { 'csl': 'dataset',
'datacite_general': 'Dataset',
'datacite_type': '',
'eurepo': 'info:eu-repo/semantics/other',
'openaire_resourceType': '21',
'openaire_type': 'dataset',
'schema.org': 'https://schema.org/Dataset',
'subtype': '',
'type': 'dataset'},
'revision_id': 1,
'tags': ['depositable', 'linkable'],
'title': {'en': 'Dataset'},
'type': 'resourcetypes',
'updated': '2022-01-11T09:15:42.753576+00:00'}