Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Jan 21, 2023
1 parent efaed1c commit 3452165
Show file tree
Hide file tree
Showing 4 changed files with 1,542 additions and 38 deletions.
89 changes: 51 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@

-------------------------------------------------------------------------------------------------------------------------------------------------------

paperetl is an ETL library for processing medical and scientific papers. It supports the following sources:
paperetl is an ETL library for processing medical and scientific papers.

![architecture](https://raw.githubusercontent.com/neuml/paperetl/master/images/architecture.png#gh-light-mode-only)
![architecture](https://raw.githubusercontent.com/neuml/paperetl/master/images/architecture-dark.png#gh-dark-mode-only)

paperetl supports the following sources:

- File formats:
- PDF
Expand All @@ -48,13 +53,17 @@ paperetl supports the following output options for storing articles:

The easiest way to install is via pip and PyPI

pip install paperetl
```
pip install paperetl
```

Python 3.7+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl
```
pip install git+https://github.com/neuml/paperetl
```

### Additional dependencies

Expand All @@ -70,7 +79,7 @@ A Dockerfile with commands to install paperetl, all dependencies and scripts is

Clone this git repository and run the following to build and run the Docker image.

```bash
```
docker build -t paperetl -f docker/Dockerfile .
docker run --name paperetl --rm -it paperetl
```
Expand All @@ -83,8 +92,6 @@ This will bring up a paperetl command shell. Standard Docker commands can be use

| Notebook | Description |
|:----------|:-------------|
| [CORD-19 Article Entry Dates](https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates) | Generates CORD-19 entry-dates.csv file |
| [CORD-19 ETL](https://www.kaggle.com/davidmezzetti/cord-19-etl) | Builds an article.sqlite database for CORD-19 data |

### Load Articles into SQLite

Expand All @@ -94,27 +101,57 @@ The following example shows how to use paperetl to load a set of medical/scienti

2. Build the database

```bash
```
python -m paperetl.file paperetl/data paperetl/models paperetl/models
```
Once complete, there will be an articles.sqlite file in paperetl/models
### Load CORD-19 into SQLite
### Load into Elasticsearch
Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.
```
python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models
```
Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.
### Convert articles to JSON/YAML
paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.
JSON:
```
python -m paperetl.file paperetl/data json://paperetl/json paperetl/models
```
YAML:
```
python -m paperetl.file paperetl/data yaml://paperetl/yaml paperetl/models
```
Converted files will be stored in paperetl/(json|yaml)
### Load CORD-19
_Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents._
The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.
1. Download and extract the dataset from [Allen Institute for AI CORD-19 Release Page](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html).
```bash
```
scripts/getcord19.sh cord19/data
```
The script above retrieves and unpacks the latest copy of CORD-19 into a directory named `cord19/data`. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.
2. Generate entry-dates.csv for current version of the dataset
```bash
```
python -m paperetl.cord19.entry cord19/data
```
Expand All @@ -123,36 +160,12 @@ The following example shows how to use paperetl to load the CORD-19 dataset into
3. Build database
```bash
```
python -m paperetl.cord19 cord19/data cord19/models
```
Once complete, there will be an articles.sqlite file in cord19/models

### Load into Elasticsearch

Both of the examples above also support storing data in Elasticsearch with the following changes. These examples assume Elasticsearch is running locally, change the URL to a remote server as appropriate.

Articles:

python -m paperetl.file paperetl/data http://localhost:9200 paperetl/models

CORD-19:
Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.
```
python -m paperetl.cord19 cord19/data http://localhost:9200

Once complete, there will be an articles index in elasticsearch with the metadata and full text stored.

### Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json paperetl/models

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml paperetl/models

Converted files will be stored in paperetl/(json|yaml)
```
Binary file added images/architecture-dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 3452165

Please sign in to comment.