GitHub - elshimone/paperetl at refs/heads/add_grobid_concurrency_to

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github/workflows		.github/workflows
docker		docker
examples		examples
images		images
scripts		scripts
src/python/paperetl		src/python/paperetl
test/python		test/python
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
logo.png		logo.png
setup.py		setup.py

Repository files navigation

ETL processes for medical and scientific papers

paperetl is an ETL library for processing medical and scientific papers.

paperetl supports the following sources:

File formats:
- PDF
- XML (arXiv, PubMed, TEI)
- CSV
COVID-19 Research Dataset (CORD-19)

paperetl supports the following output options for storing articles:

SQLite
Elasticsearch
JSON files
YAML files

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

Python 3.8+ is supported. Using a Python virtual environment is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl

Additional dependencies

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only necessary for PDF files.

Note the concurrency setting for the GROBID service is 10. Depending on the number of CPUs in your system, this may cause paperetl to exhaust the GROBID engine pool, resulting in a 503 service unable error response when parsing PDFs. You can avoid this by increasing the concurrency setting in the GROBID configuration file as described in this section of the documentation.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts is available in this repository.

wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.

Examples

Notebooks

Notebook	Description
Introducing paperetl	Overview of the functionality provided by paperetl

Load Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific articles into a SQLite database.

Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

Build the database

python -m paperetl.file paperetl/data paperetl/models

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.

python -m paperetl.file paperetl/data http://localhost:9200

Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.

Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml

Converted files will be stored in paperetl/(json|yaml)

Load CORD-19

Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents.

The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.

Download and extract the dataset from Allen Institute for AI CORD-19 Release Page.
```
scripts/getcord19.sh cord19/data
```
The script above retrieves and unpacks the latest copy of CORD-19 into a directory named cord19/data. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.
Generate entry-dates.csv for current version of the dataset
```
python -m paperetl.cord19.entry cord19/data
```
An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults of the latest date. This should match the date used in Step 1.
Build database
```
python -m paperetl.cord19 cord19/data cord19/models
```
Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.
```
python -m paperetl.cord19 cord19/data http://localhost:9200
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Additional dependencies

Docker

Examples

Notebooks

Load Articles into SQLite

Load into Elasticsearch

Convert articles to JSON/YAML

Load CORD-19

About

Releases

Packages

Languages

License

elshimone/paperetl

Folders and files

Latest commit

History

Repository files navigation

Installation

Additional dependencies

Docker

Examples

Notebooks

Load Articles into SQLite

Load into Elasticsearch

Convert articles to JSON/YAML

Load CORD-19

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages