Skip to content

A schema-driven tool that automatically ingests data from Elsevier Pure API into a queryable SQLite database with full support for nested data structures, discriminated unions, and localized content

License

Notifications You must be signed in to change notification settings

svidmar/Pure-API-SQLite-Ingestion-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pure API → SQLite Ingestion Tool

A schema-driven tool that automatically ingests data from Elsevier Pure API into a queryable SQLite database with full support for nested data structures, discriminated unions, and localized content.

Features

🎯 Schema-Driven Design

  • Zero hardcoded assumptions - Everything driven by your OpenAPI specification
  • Works with any Pure endpoint - No per-endpoint customization needed
  • Future-proof - Adapts automatically to API changes

🔄 Deep Nested Array Support

  • Recursive unpacking to arbitrary depth (configurable)
  • Handles complex nested structures like datasets → contributors → organizations
  • Creates proper foreign key relationships for efficient JOINs
  • Preserves array ordering with ord columns

🌍 Localization-Aware

  • Automatically detects and unpacks localized strings (FormattedLocalizedString)
  • Creates separate columns for each locale (e.g., title_en_gb, title_da_dk)
  • Handles locale dictionaries, locale arrays, and nested locale values

🔀 Discriminated Union Support

  • Fully resolves OpenAPI oneOf/anyOf/allOf schemas
  • Handles discriminator mappings (OpenAPI 3.0 style)
  • Merges all variant properties to capture complete data model
  • Examples: Internal vs External contributors, prize receivers, etc.

💾 Idempotent Updates

  • Stores content hash to detect changes
  • Updates only when data changes
  • Tracks first_seen and last_seen timestamps
  • Preserves raw JSON for complete audit trail

Installation

Prerequisites

pip install pyyaml requests

Setup

  1. Download your Pure OpenAPI specification (usually openapi.yaml)
  2. Create a .env file with your credentials:
# Required settings
PURE_BASE_URL=https://your-institution.pure.elsevier.com/ws/api
PURE_API_KEY=your_api_key_here

# Optional settings (defaults shown)
PAGE_SIZE=100                # Number of items per API request
RATE_LIMIT_RPS=3            # Requests per second (adjust based on your API limits)
TIMEOUT_SECONDS=60          # HTTP request timeout
RETRY_MAX=5                 # Maximum retries on 5xx/429 errors

Note: The tool also accepts these settings as command-line arguments, which override .env values:

  • --page-size 200 - Override PAGE_SIZE
  • --rps 5.0 - Override RATE_LIMIT_RPS
  • --timeout 120 - Override TIMEOUT_SECONDS
  • --retry-max 10 - Override RETRY_MAX

Usage

Basic Usage - Ingest All Endpoints

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --discover

Ingest Specific Endpoints

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --paths /data-sets /research-outputs /prizes

Testing with Limited Data

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure_test.sqlite \
    --paths /data-sets \
    --limit 100

This will ingest only the first 100 datasets, perfect for testing!

Advanced Options

python pure_openapi_ingest.py \
    --openapi ./openapi.yaml \
    --db ./pure.sqlite \
    --discover \
    --max-depth 10 \           # Allow deeper nesting (default: 5)
    --page-size 200 \          # Larger pages (default: 100)
    --rps 5.0 \                # Higher rate limit (default: 3.0)
    --limit 1000 \             # Ingest only 1000 items per endpoint
    --verbose                  # Show detailed progress and log to console

Logging and Monitoring

Automatic Log Files

Every ingestion run creates a timestamped log file: pure_ingest_YYYYMMDD_HHMMSS.log

The log file contains:

  • Configuration details (base URL, database path, rate limits)
  • Table creation events
  • Pagination progress (logged every 30 seconds)
  • Individual item errors with stack traces
  • Final statistics summary

Example log output:

2025-10-24 10:15:32 - INFO - Pure API Ingestion Started
2025-10-24 10:15:32 - INFO - Base URL: https://your-institution.pure.elsevier.com/ws/api
2025-10-24 10:15:32 - INFO - Database: ./pure.sqlite
2025-10-24 10:15:33 - INFO - Found 24 list endpoints
2025-10-24 10:15:35 - INFO - Created parent table: data_sets
2025-10-24 10:15:35 - INFO - Created child table: data_sets__contributors
2025-10-24 10:16:05 - INFO - Page 30: Processed 100 items in 2.1s | Total: 3000/50000 (6.0%) | ETA: 45.2m
2025-10-24 11:00:47 - INFO - ✓ Completed /data-sets: 50000 items ingested, 47 tables created, 0 errors, 2712.3s

Progress Tracking

During ingestion, the tool displays real-time progress:

Page 42: Processed 100 items in 1.8s | Total: 4200/50000 (8.4%) | ETA: 38.6m

This shows:

  • Page number: Current API page being processed
  • Items processed: Items in this page (and failed count if any)
  • Duration: Time to process this page
  • Total progress: Items processed / total items (percentage)
  • ETA: Estimated time remaining (in minutes or hours)

Error Handling

  • Automatic retries: Failed HTTP requests retry up to 5 times with exponential backoff
  • Per-item error isolation: If one item fails, others in the same page continue processing
  • Error logging: All errors logged with context (item ID, error message, stack trace)
  • Partial success: Endpoints that partially complete will save all successfully processed items

Final Statistics

At the end of each run:

============================================================
INGESTION COMPLETE
============================================================
Total duration: 3245.2s (54.1m)
Endpoints processed: 24/24
Total items ingested: 1,234,567
Total pages fetched: 12,346
Total tables created: 387
Average rate: 380.5 items/second
============================================================

Log file: pure_ingest_20251024_101532.log

How It Works

1. Schema Discovery

The tool reads your openapi.yaml and:

  • Discovers all list endpoints (GET methods with pagination)
  • Resolves schema references ($ref)
  • Merges discriminated unions (oneOf, anyOf, allOf)
  • Flattens nested object structures
  • Identifies array fields that need child tables

2. Table Generation

For each endpoint, creates:

  • Parent table with scalar fields and localized columns
  • Child tables for each array field (with __ separator)
  • Nested child tables for arrays within arrays (recursive)

Example: /data-sets endpoint creates:

data_sets                                    (main table)
├── data_sets__contributors                  (person associations)
│   ├── data_sets__contributors__organizations           (internal orgs)
│   └── data_sets__contributors__external_organizations  (external orgs)
├── data_sets__persons                       (alternative person field)
│   ├── data_sets__persons__organizations
│   └── data_sets__persons__external_organizations
├── data_sets__descriptions                  (localized descriptions)
├── data_sets__links                         (related URLs)
└── ... (40+ tables for complete data model)

3. Data Unpacking

For each item retrieved from the API:

Localized Strings:

{"title": {"en_GB": "Dataset Title", "da_DK": "Datasæt Titel"}}

→ Creates columns: title_en_gb, title_da_dk

Nested Objects:

{"role": {"uri": "/path", "term": {"en_GB": "Recipient"}}}

→ Creates columns: role_uri, role_term_en_gb

Person References:

{"person": {"systemName": "Person", "uuid": "abc-123"}}

→ Creates column: person_uuid

Arrays:

{"organizations": [
  {"uuid": "org-1"},
  {"uuid": "org-2"}
]}

→ Creates child table rows with foreign keys back to parent

4. Relationship Management

All child tables include:

  • Composite foreign keys to parent (handles nested tables)
  • ord column to preserve array order
  • Raw JSON for complex objects (audit trail)
  • Automatic cascade deletes when parent is removed

Database Schema

Parent Table Structure

CREATE TABLE "data_sets" (
  "uuid" TEXT PRIMARY KEY,           -- Primary key (auto-detected)
  "raw_json" TEXT,                   -- Full JSON from API
  "hash" TEXT,                       -- Content hash for change detection
  "first_seen" TEXT,                 -- ISO timestamp
  "last_seen" TEXT,                  -- ISO timestamp
  "title_en_gb" TEXT,                -- Localized title
  "title_da_dk" TEXT,
  "type_uri" TEXT,                   -- Flattened nested object
  "type_term_en_gb" TEXT,
  "managing_organization_uuid" TEXT, -- Reference extraction
  -- ... all other scalar fields
);

Child Table Structure

CREATE TABLE "data_sets__contributors" (
  "data_sets_uuid" TEXT,             -- FK to parent
  "ord" INTEGER,                     -- Array position
  "name_first_name" TEXT,            -- Unpacked fields
  "name_last_name" TEXT,
  "person_uuid" TEXT,
  "role_uri" TEXT,
  "role_term_en_gb" TEXT,
  "type_discriminator" TEXT,         -- Union type identifier
  "raw_json" TEXT,
  PRIMARY KEY ("data_sets_uuid", "ord"),
  FOREIGN KEY ("data_sets_uuid") REFERENCES "data_sets"("uuid") ON DELETE CASCADE
);

Nested Child Table Structure

CREATE TABLE "data_sets__contributors__organizations" (
  "data_sets__contributors_data_sets_uuid" TEXT,  -- FK part 1
  "data_sets__contributors_ord" TEXT,             -- FK part 2
  "ord" INTEGER,                                  -- This array's position
  "uuid" TEXT,                                    -- Organization UUID
  PRIMARY KEY (
    "data_sets__contributors_data_sets_uuid",
    "data_sets__contributors_ord",
    "ord"
  ),
  FOREIGN KEY (
    "data_sets__contributors_data_sets_uuid",
    "data_sets__contributors_ord"
  ) REFERENCES "data_sets__contributors"(
    "data_sets_uuid",
    "ord"
  ) ON DELETE CASCADE
);

Querying the Data

Basic Query: Get Dataset with Contributors

SELECT
    ds.title_en_gb,
    c.name_first_name || ' ' || c.name_last_name as contributor_name,
    c.person_uuid,
    c.role_term_en_gb as role
FROM data_sets ds
JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY c.ord;

Advanced Query: Contributors with Organizations

SELECT
    ds.title_en_gb as dataset_title,
    c.name_first_name || ' ' || c.name_last_name as contributor_name,
    c.type_discriminator,
    org.uuid as org_uuid,
    'Internal' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
    ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
    ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'

UNION ALL

SELECT
    ds.title_en_gb,
    c.name_first_name || ' ' || c.name_last_name,
    c.type_discriminator,
    ext_org.uuid,
    'External' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
    ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__external_organizations ext_org
    ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY contributor_name, org_type;

Aggregation Query: Dataset Statistics

SELECT
    ds.title_en_gb,
    COUNT(DISTINCT c.person_uuid) as num_contributors,
    COUNT(DISTINCT org.uuid) as num_internal_orgs,
    COUNT(DISTINCT ext_org.uuid) as num_external_orgs
FROM data_sets ds
LEFT JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
    ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND org.data_sets__contributors_ord = c.ord
LEFT JOIN data_sets__contributors__external_organizations ext_org
    ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
    AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
GROUP BY ds.uuid, ds.title_en_gb;

Resumption and Idempotency

Safe Re-runs

The tool is fully idempotent - you can safely re-run the same command multiple times:

# Run 1: Partial completion (network issues, Ctrl+C, etc.)
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discover

# Run 2: Resume from where you left off - no duplicates!
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discover

How It Works

Content Hash Tracking:

  • Each item gets a SHA-256 hash of its JSON content
  • On upsert, the tool checks if the hash has changed
  • Only updates the database if content actually changed
  • Tracks first_seen and last_seen timestamps

Result:

  • ✅ No duplicate data
  • ✅ Detects and updates changed records
  • ✅ Safe to interrupt and resume
  • ✅ Incremental updates work perfectly

Example Use Cases:

  1. Interrupted runs: Press Ctrl+C, then re-run the same command
  2. Network failures: Automatic retry handles transient errors, re-run handles persistent failures
  3. Incremental updates: Run weekly to capture new/changed records
  4. Testing: Use --limit 100 first, then run without limit to complete

Multi-Endpoint Parallelization

For even faster ingestion, you can run multiple endpoints in parallel using separate processes:

# Terminal 1
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /data-sets

# Terminal 2
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /research-outputs

# Terminal 3
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /activities

SQLite's WAL mode enables concurrent writes from multiple processes.

Configuration

Adjustable Parameters

  • max_depth (default: 5) - Maximum nesting depth for recursive unpacking
  • prefix_separator (default: "_") - Separator for nested field names
  • array_table_separator (default: "__") - Separator for child table names
  • skip_empty_arrays (default: True) - Don't insert rows for empty arrays
  • create_indexes (default: True) - Create indexes on foreign keys

Troubleshooting

No Data in Child Tables

Check if the field is being detected as an array in the schema:

-- Should show tables like parent__fieldname
SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;

Foreign Key Errors

Ensure foreign keys are enabled:

PRAGMA foreign_keys=ON;

Missing Columns

The tool only creates columns for fields that exist in the OpenAPI schema. If a field appears in the JSON but not in the database:

  1. Check if the field is defined in the schema
  2. Verify the schema resolver is properly handling allOf/oneOf
  3. Use --verbose to see which fields are being detected

Performance Issues

For large datasets:

  • Increase --page-size (up to 1000)
  • Adjust --rps based on your API rate limits
  • Use --limit for testing before full ingestion
  • Consider --no-indexes during bulk load, add indexes after

Technical Details

Recursion Safety

  • Maximum depth limit prevents infinite recursion
  • Cycle detection breaks circular schema references
  • Stack tracking prevents duplicate resolution

Schema Resolution

  • Caches resolved schemas to avoid recomputation
  • Handles $ref, allOf, oneOf, anyOf
  • Expands discriminator mappings
  • Merges inherited properties from base schemas

Type Detection

  • Auto-detects dates, enums, references
  • Recognizes locale dictionaries by pattern
  • Distinguishes scalars vs objects vs arrays
  • Maps OpenAPI types to SQLite types

License

This tool is provided under the MIT License. See LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Support

For issues or questions:

  1. Check the SQLite database structure matches expectations
  2. Use --verbose for detailed execution logs
  3. Test with --limit 10 before full ingestion
  4. Review the OpenAPI schema for field definitions
  5. Check existing issues on GitHub before creating new ones

Acknowledgments

This tool was developed for use with Elsevier Pure API systems. It is not officially affiliated with or endorsed by Elsevier.

About

A schema-driven tool that automatically ingests data from Elsevier Pure API into a queryable SQLite database with full support for nested data structures, discriminated unions, and localized content

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages