A web service to extract metadata from a digital-native PDF report.
Requires Python 3.11
First time:
git clone git@github.com:NationalLibraryOfNorway/meteor.git
cd meteor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Copy and edit the env file to set MOUNT_FOLDER to the local path that will be available to the web service's /file
endpoint.
cp .env.example .env
Finally, start the flask app (debug mode optional, reloads service automatically after file change):
uvicorn main:app --reload --port=5000
Then open http://127.0.0.1:5000
or use curl
curl http://127.0.0.1:5000/file/<name of file in MOUNT_FOLDER>
curl -F fileInput=@/path/to/file.pdf http://127.0.0.1:5000/json
curl -d fileUrl=https://www.link.to/report.pdf http://127.0.0.1:5000/json
After installing requirements, run pre-commit install
. This adds a pre-commit PEP8 compliance check.
In order to install the core module metadata_extract
, simply run:
python3 -m pip install .
Usage:
>>> from metadata_extract import meteor
>>> m = meteor.Meteor()
>>> results = m.run('/path/to/file.pdf')
For now, the program attempts to identify:
- ISBN
- ISSN
- Title
- Publisher
- Publication year
- Language
- Authors
- Publication type
Publisher names can be looked for in the Norwegian Authority File.
To build a database from the registry's API, define the environment variables as described in .env.example
then run script.sh
in the registry
directory.
The resulting database will contain all entries of type corporations (MARC field 110 is present) and of quality level kat2 and kat3 (in MARC field 901).