Pure API → SQLite Ingestion Tool

A schema-driven tool that automatically ingests data from Elsevier Pure API into a queryable SQLite database with full support for nested data structures, discriminated unions, and localized content.

Features

🎯 Schema-Driven Design

Zero hardcoded assumptions - Everything driven by your OpenAPI specification
Works with any Pure endpoint - No per-endpoint customization needed
Future-proof - Adapts automatically to API changes

🔄 Deep Nested Array Support

Recursive unpacking to arbitrary depth (configurable)
Handles complex nested structures like datasets → contributors → organizations
Creates proper foreign key relationships for efficient JOINs
Preserves array ordering with ord columns

🌍 Localization-Aware

Automatically detects and unpacks localized strings (FormattedLocalizedString)
Creates separate columns for each locale (e.g., title_en_gb, title_da_dk)
Handles locale dictionaries, locale arrays, and nested locale values

🔀 Discriminated Union Support

Fully resolves OpenAPI oneOf/anyOf/allOf schemas
Handles discriminator mappings (OpenAPI 3.0 style)
Merges all variant properties to capture complete data model
Examples: Internal vs External contributors, prize receivers, etc.

💾 Idempotent Updates

Stores content hash to detect changes
Updates only when data changes
Tracks first_seen and last_seen timestamps
Preserves raw JSON for complete audit trail

Installation

Prerequisites

pip install pyyaml requests

Setup

Download your Pure OpenAPI specification (usually openapi.yaml)
Create a .env file with your credentials:

# Required settings
PURE_BASE_URL=https://your-institution.pure.elsevier.com/ws/api
PURE_API_KEY=your_api_key_here

# Optional settings (defaults shown)
PAGE_SIZE=100                # Number of items per API request
RATE_LIMIT_RPS=3            # Requests per second (adjust based on your API limits)
TIMEOUT_SECONDS=60          # HTTP request timeout
RETRY_MAX=5                 # Maximum retries on 5xx/429 errors

Note: The tool also accepts these settings as command-line arguments, which override .env values:

--page-size 200 - Override PAGE_SIZE
--rps 5.0 - Override RATE_LIMIT_RPS
--timeout 120 - Override TIMEOUT_SECONDS
--retry-max 10 - Override RETRY_MAX

Usage

Basic Usage - Ingest All Endpoints

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --discover

Ingest Specific Endpoints

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --paths /data-sets /research-outputs /prizes

Testing with Limited Data

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure_test.sqlite \
    --paths /data-sets \
    --limit 100

This will ingest only the first 100 datasets, perfect for testing!

Advanced Options

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --discover \
    --max-depth 10 \           # Allow deeper nesting (default: 5)
    --page-size 200 \          # Larger pages (default: 100)
    --rps 5.0 \                # Higher rate limit (default: 3.0)
    --limit 1000 \             # Ingest only 1000 items per endpoint
    --verbose                  # Show detailed progress and log to console

Logging and Monitoring

Automatic Log Files

Every ingestion run creates a timestamped log file: pure_ingest_YYYYMMDD_HHMMSS.log

The log file contains:

Configuration details (base URL, database path, rate limits)
Table creation events
Pagination progress (logged every 30 seconds)
Individual item errors with stack traces
Final statistics summary

Example log output:

2025-10-24 10:15:32 - INFO - Pure API Ingestion Started
2025-10-24 10:15:32 - INFO - Base URL: https://your-institution.pure.elsevier.com/ws/api
2025-10-24 10:15:32 - INFO - Database: ./pure.sqlite
2025-10-24 10:15:33 - INFO - Found 24 list endpoints
2025-10-24 10:15:35 - INFO - Created parent table: data_sets
2025-10-24 10:15:35 - INFO - Created child table: data_sets__contributors
2025-10-24 10:16:05 - INFO - Page 30: Processed 100 items in 2.1s | Total: 3000/50000 (6.0%) | ETA: 45.2m
2025-10-24 11:00:47 - INFO - ✓ Completed /data-sets: 50000 items ingested, 47 tables created, 0 errors, 2712.3s

Progress Tracking

During ingestion, the tool displays real-time progress:

Page 42: Processed 100 items in 1.8s | Total: 4200/50000 (8.4%) | ETA: 38.6m

This shows:

Page number: Current API page being processed
Items processed: Items in this page (and failed count if any)
Duration: Time to process this page
Total progress: Items processed / total items (percentage)
ETA: Estimated time remaining (in minutes or hours)

Error Handling

Automatic retries: Failed HTTP requests retry up to 5 times with exponential backoff
Per-item error isolation: If one item fails, others in the same page continue processing
Error logging: All errors logged with context (item ID, error message, stack trace)
Partial success: Endpoints that partially complete will save all successfully processed items

Final Statistics

At the end of each run:

============================================================
INGESTION COMPLETE
============================================================
Total duration: 3245.2s (54.1m)
Endpoints processed: 24/24
Total items ingested: 1,234,567
Total pages fetched: 12,346
Total tables created: 387
Average rate: 380.5 items/second
============================================================

Log file: pure_ingest_20251024_101532.log

How It Works

1. Schema Discovery

The tool reads your openapi.yaml and:

Discovers all list endpoints (GET methods with pagination)
Resolves schema references ($ref)
Merges discriminated unions (oneOf, anyOf, allOf)
Flattens nested object structures
Identifies array fields that need child tables

2. Table Generation

For each endpoint, creates:

Parent table with scalar fields and localized columns
Child tables for each array field (with __ separator)
Nested child tables for arrays within arrays (recursive)

Example: /data-sets endpoint creates:

data_sets                                    (main table)
├── data_sets__contributors                  (person associations)
│   ├── data_sets__contributors__organizations           (internal orgs)
│   └── data_sets__contributors__external_organizations  (external orgs)
├── data_sets__persons                       (alternative person field)
│   ├── data_sets__persons__organizations
│   └── data_sets__persons__external_organizations
├── data_sets__descriptions                  (localized descriptions)
├── data_sets__links                         (related URLs)
└── ... (40+ tables for complete data model)

3. Data Unpacking

For each item retrieved from the API:

Localized Strings:

{"title": {"en_GB": "Dataset Title", "da_DK": "Datasæt Titel"}}

→ Creates columns: title_en_gb, title_da_dk

Nested Objects:

{"role": {"uri": "/path", "term": {"en_GB": "Recipient"}}}

→ Creates columns: role_uri, role_term_en_gb

Person References:

{"person": {"systemName": "Person", "uuid": "abc-123"}}

→ Creates column: person_uuid

Arrays:

{"organizations": [
  {"uuid": "org-1"},
  {"uuid": "org-2"}
]}

→ Creates child table rows with foreign keys back to parent

4. Relationship Management

All child tables include:

Composite foreign keys to parent (handles nested tables)
ord column to preserve array order
Raw JSON for complex objects (audit trail)
Automatic cascade deletes when parent is removed

Database Schema

Parent Table Structure

CREATE TABLE "data_sets" (
  "uuid" TEXT PRIMARY KEY,           -- Primary key (auto-detected)
  "raw_json" TEXT,                   -- Full JSON from API
  "hash" TEXT,                       -- Content hash for change detection
  "first_seen" TEXT,                 -- ISO timestamp
  "last_seen" TEXT,                  -- ISO timestamp
  "title_en_gb" TEXT,                -- Localized title
  "title_da_dk" TEXT,
  "type_uri" TEXT,                   -- Flattened nested object
  "type_term_en_gb" TEXT,
  "managing_organization_uuid" TEXT, -- Reference extraction
  -- ... all other scalar fields
);

Child Table Structure

CREATE TABLE "data_sets__contributors" (
  "data_sets_uuid" TEXT,             -- FK to parent
  "ord" INTEGER,                     -- Array position
  "name_first_name" TEXT,            -- Unpacked fields
  "name_last_name" TEXT,
  "person_uuid" TEXT,
  "role_uri" TEXT,
  "role_term_en_gb" TEXT,
  "type_discriminator" TEXT,         -- Union type identifier
  "raw_json" TEXT,
  PRIMARY KEY ("data_sets_uuid", "ord"),
  FOREIGN KEY ("data_sets_uuid") REFERENCES "data_sets"("uuid") ON DELETE CASCADE
);

Nested Child Table Structure

CREATE TABLE "data_sets__contributors__organizations" (
  "data_sets__contributors_data_sets_uuid" TEXT,  -- FK part 1
  "data_sets__contributors_ord" TEXT,             -- FK part 2
  "ord" INTEGER,                                  -- This array's position
  "uuid" TEXT,                                    -- Organization UUID
  PRIMARY KEY (
    "data_sets__contributors_data_sets_uuid",
    "data_sets__contributors_ord",
    "ord"
  ),
  FOREIGN KEY (
    "data_sets__contributors_data_sets_uuid",
    "data_sets__contributors_ord"
  ) REFERENCES "data_sets__contributors"(
    "data_sets_uuid",
    "ord"
  ) ON DELETE CASCADE
);

Querying the Data

Basic Query: Get Dataset with Contributors

SELECT
    ds.title_en_gb,
    c.name_first_name || ' ' || c.name_last_name as contributor_name,
    c.person_uuid,
    c.role_term_en_gb as role
FROM data_sets ds
JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY c.ord;

Advanced Query: Contributors with Organizations

SELECT
    ds.title_en_gb as dataset_title,
    c.name_first_name || ' ' || c.name_last_name as contributor_name,
    c.type_discriminator,
    org.uuid as org_uuid,
    'Internal' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
    ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
    ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'

UNION ALL

SELECT
    ds.title_en_gb,
    c.name_first_name || ' ' || c.name_last_name,
    c.type_discriminator,
    ext_org.uuid,
    'External' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
    ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__external_organizations ext_org
    ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY contributor_name, org_type;

Aggregation Query: Dataset Statistics

SELECT
    ds.title_en_gb,
    COUNT(DISTINCT c.person_uuid) as num_contributors,
    COUNT(DISTINCT org.uuid) as num_internal_orgs,
    COUNT(DISTINCT ext_org.uuid) as num_external_orgs
FROM data_sets ds
LEFT JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
    ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND org.data_sets__contributors_ord = c.ord
LEFT JOIN data_sets__contributors__external_organizations ext_org
    ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
GROUP BY ds.uuid, ds.title_en_gb;

Resumption and Idempotency

Safe Re-runs

The tool is fully idempotent - you can safely re-run the same command multiple times:

# Run 1: Partial completion (network issues, Ctrl+C, etc.)
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discover

# Run 2: Resume from where you left off - no duplicates!
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discover

How It Works

Content Hash Tracking:

Each item gets a SHA-256 hash of its JSON content
On upsert, the tool checks if the hash has changed
Only updates the database if content actually changed
Tracks first_seen and last_seen timestamps

Result:

✅ No duplicate data
✅ Detects and updates changed records
✅ Safe to interrupt and resume
✅ Incremental updates work perfectly

Example Use Cases:

Interrupted runs: Press Ctrl+C, then re-run the same command
Network failures: Automatic retry handles transient errors, re-run handles persistent failures
Incremental updates: Run weekly to capture new/changed records
Testing: Use --limit 100 first, then run without limit to complete

Multi-Endpoint Parallelization

For even faster ingestion, you can run multiple endpoints in parallel using separate processes:

# Terminal 1
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /data-sets

# Terminal 2
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /research-outputs

# Terminal 3
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /activities

SQLite's WAL mode enables concurrent writes from multiple processes.

Configuration

Adjustable Parameters

max_depth (default: 5) - Maximum nesting depth for recursive unpacking
prefix_separator (default: "_") - Separator for nested field names
array_table_separator (default: "__") - Separator for child table names
skip_empty_arrays (default: True) - Don't insert rows for empty arrays
create_indexes (default: True) - Create indexes on foreign keys

Troubleshooting

No Data in Child Tables

Check if the field is being detected as an array in the schema:

-- Should show tables like parent__fieldname
SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;

Foreign Key Errors

Ensure foreign keys are enabled:

PRAGMA foreign_keys=ON;

Missing Columns

The tool only creates columns for fields that exist in the OpenAPI schema. If a field appears in the JSON but not in the database:

Check if the field is defined in the schema
Verify the schema resolver is properly handling allOf/oneOf
Use --verbose to see which fields are being detected

Performance Issues

For large datasets:

Increase --page-size (up to 1000)
Adjust --rps based on your API rate limits
Use --limit for testing before full ingestion
Consider --no-indexes during bulk load, add indexes after

Technical Details

Recursion Safety

Maximum depth limit prevents infinite recursion
Cycle detection breaks circular schema references
Stack tracking prevents duplicate resolution

Schema Resolution

Caches resolved schemas to avoid recomputation
Handles $ref, allOf, oneOf, anyOf
Expands discriminator mappings
Merges inherited properties from base schemas

Type Detection

Auto-detects dates, enums, references
Recognizes locale dictionaries by pattern
Distinguishes scalars vs objects vs arrays
Maps OpenAPI types to SQLite types

License

This tool is provided under the MIT License. See LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Support

For issues or questions:

Check the SQLite database structure matches expectations
Use --verbose for detailed execution logs
Test with --limit 10 before full ingestion
Review the OpenAPI schema for field definitions
Check existing issues on GitHub before creating new ones

Acknowledgments

This tool was developed for use with Elsevier Pure API systems. It is not officially affiliated with or endorsed by Elsevier.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pure_openapi_ingest.py		pure_openapi_ingest.py

License

svidmar/Pure-API-SQLite-Ingestion-Tool

Folders and files

Latest commit

History

Repository files navigation

Pure API → SQLite Ingestion Tool

Features

🎯 Schema-Driven Design

🔄 Deep Nested Array Support

🌍 Localization-Aware

🔀 Discriminated Union Support

💾 Idempotent Updates

Installation

Prerequisites

Setup

Usage

Basic Usage - Ingest All Endpoints

Ingest Specific Endpoints

Testing with Limited Data

Advanced Options

Logging and Monitoring

Automatic Log Files

Progress Tracking

Error Handling

Final Statistics

How It Works

1. Schema Discovery

2. Table Generation

3. Data Unpacking

4. Relationship Management

Database Schema

Parent Table Structure

Child Table Structure

Nested Child Table Structure

Querying the Data

Basic Query: Get Dataset with Contributors

Advanced Query: Contributors with Organizations

Aggregation Query: Dataset Statistics

Resumption and Idempotency

Safe Re-runs

How It Works

Multi-Endpoint Parallelization

Configuration

Adjustable Parameters

Troubleshooting

No Data in Child Tables

Foreign Key Errors

Missing Columns

Performance Issues

Technical Details

Recursion Safety

Schema Resolution

Type Detection

License

Contributing

Support

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages