METEOR - Metadata extraction from PDF reports

A web service to extract metadata from a digital-native PDF report.

Start the program

Requires Python 3.11

First time:

git clone git@github.com:NationalLibraryOfNorway/meteor.git
cd meteor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Copy and edit the env file to set MOUNT_FOLDER to the local path that will be available to the web service's /file endpoint.

cp .env.example .env

Finally, start the flask app (debug mode optional, reloads service automatically after file change):

uvicorn main:app --reload --port=5000

Then open http://127.0.0.1:5000

or use curl

curl http://127.0.0.1:5000/file/<name of file in MOUNT_FOLDER>

curl -F fileInput=@/path/to/file.pdf http://127.0.0.1:5000/json

curl -d fileUrl=https://www.link.to/report.pdf http://127.0.0.1:5000/json

Local development

After installing requirements, run pre-commit install. This adds a pre-commit PEP8 compliance check.

Installing the python module

In order to install the core module metadata_extract, simply run:

python3 -m pip install .

Usage:

>>> from metadata_extract import meteor
>>> m = meteor.Meteor()
>>> results = m.run('/path/to/file.pdf')

Extracted fields

For now, the program attempts to identify:

ISBN
ISSN
Title
Publisher
Publication year
Language
Authors
Publication type

Norwegian Authority File

Publisher names can be looked for in the Norwegian Authority File.

To build a database from the registry's API, define the environment variables as described in .env.example then run script.sh in the registry directory.

The resulting database will contain all entries of type corporations (MARC field 110 is present) and of quality level kat2 and kat3 (in MARC field 901).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
diff		diff
language		language
metadata_extract		metadata_extract
registry		registry
src		src
static		static
templates		templates
test		test
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
DOC.md		DOC.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_on_file.py		run_on_file.py
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

METEOR - Metadata extraction from PDF reports

Start the program

Local development

Installing the python module

Extracted fields

Norwegian Authority File

About

Releases

Packages

Languages

License

Ingerid/meteor

Folders and files

Latest commit

History

Repository files navigation

METEOR - Metadata extraction from PDF reports

Start the program

Local development

Installing the python module

Extracted fields

Norwegian Authority File

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages