A schema-driven tool that automatically ingests data from Elsevier Pure API into a queryable SQLite database with full support for nested data structures, discriminated unions, and localized content.
- Zero hardcoded assumptions - Everything driven by your OpenAPI specification
- Works with any Pure endpoint - No per-endpoint customization needed
- Future-proof - Adapts automatically to API changes
- Recursive unpacking to arbitrary depth (configurable)
- Handles complex nested structures like
datasets → contributors → organizations - Creates proper foreign key relationships for efficient JOINs
- Preserves array ordering with
ordcolumns
- Automatically detects and unpacks localized strings (
FormattedLocalizedString) - Creates separate columns for each locale (e.g.,
title_en_gb,title_da_dk) - Handles locale dictionaries, locale arrays, and nested locale values
- Fully resolves OpenAPI
oneOf/anyOf/allOfschemas - Handles discriminator mappings (OpenAPI 3.0 style)
- Merges all variant properties to capture complete data model
- Examples: Internal vs External contributors, prize receivers, etc.
- Stores content hash to detect changes
- Updates only when data changes
- Tracks
first_seenandlast_seentimestamps - Preserves raw JSON for complete audit trail
pip install pyyaml requests- Download your Pure OpenAPI specification (usually
openapi.yaml) - Create a
.envfile with your credentials:
# Required settings
PURE_BASE_URL=https://your-institution.pure.elsevier.com/ws/api
PURE_API_KEY=your_api_key_here
# Optional settings (defaults shown)
PAGE_SIZE=100 # Number of items per API request
RATE_LIMIT_RPS=3 # Requests per second (adjust based on your API limits)
TIMEOUT_SECONDS=60 # HTTP request timeout
RETRY_MAX=5 # Maximum retries on 5xx/429 errorsNote: The tool also accepts these settings as command-line arguments, which override .env values:
--page-size 200- Override PAGE_SIZE--rps 5.0- Override RATE_LIMIT_RPS--timeout 120- Override TIMEOUT_SECONDS--retry-max 10- Override RETRY_MAX
python pure_openapi_ingest.py \
--openapi ./openapi.yaml \
--db ./pure.sqlite \
--discoverpython pure_openapi_ingest.py \
--openapi ./openapi.yaml \
--db ./pure.sqlite \
--paths /data-sets /research-outputs /prizespython pure_openapi_ingest.py \
--openapi ./openapi.yaml \
--db ./pure_test.sqlite \
--paths /data-sets \
--limit 100This will ingest only the first 100 datasets, perfect for testing!
python pure_openapi_ingest.py \
--openapi ./openapi.yaml \
--db ./pure.sqlite \
--discover \
--max-depth 10 \ # Allow deeper nesting (default: 5)
--page-size 200 \ # Larger pages (default: 100)
--rps 5.0 \ # Higher rate limit (default: 3.0)
--limit 1000 \ # Ingest only 1000 items per endpoint
--verbose # Show detailed progress and log to consoleEvery ingestion run creates a timestamped log file: pure_ingest_YYYYMMDD_HHMMSS.log
The log file contains:
- Configuration details (base URL, database path, rate limits)
- Table creation events
- Pagination progress (logged every 30 seconds)
- Individual item errors with stack traces
- Final statistics summary
Example log output:
2025-10-24 10:15:32 - INFO - Pure API Ingestion Started
2025-10-24 10:15:32 - INFO - Base URL: https://your-institution.pure.elsevier.com/ws/api
2025-10-24 10:15:32 - INFO - Database: ./pure.sqlite
2025-10-24 10:15:33 - INFO - Found 24 list endpoints
2025-10-24 10:15:35 - INFO - Created parent table: data_sets
2025-10-24 10:15:35 - INFO - Created child table: data_sets__contributors
2025-10-24 10:16:05 - INFO - Page 30: Processed 100 items in 2.1s | Total: 3000/50000 (6.0%) | ETA: 45.2m
2025-10-24 11:00:47 - INFO - ✓ Completed /data-sets: 50000 items ingested, 47 tables created, 0 errors, 2712.3s
During ingestion, the tool displays real-time progress:
Page 42: Processed 100 items in 1.8s | Total: 4200/50000 (8.4%) | ETA: 38.6m
This shows:
- Page number: Current API page being processed
- Items processed: Items in this page (and failed count if any)
- Duration: Time to process this page
- Total progress: Items processed / total items (percentage)
- ETA: Estimated time remaining (in minutes or hours)
- Automatic retries: Failed HTTP requests retry up to 5 times with exponential backoff
- Per-item error isolation: If one item fails, others in the same page continue processing
- Error logging: All errors logged with context (item ID, error message, stack trace)
- Partial success: Endpoints that partially complete will save all successfully processed items
At the end of each run:
============================================================
INGESTION COMPLETE
============================================================
Total duration: 3245.2s (54.1m)
Endpoints processed: 24/24
Total items ingested: 1,234,567
Total pages fetched: 12,346
Total tables created: 387
Average rate: 380.5 items/second
============================================================
Log file: pure_ingest_20251024_101532.log
The tool reads your openapi.yaml and:
- Discovers all list endpoints (GET methods with pagination)
- Resolves schema references (
$ref) - Merges discriminated unions (
oneOf,anyOf,allOf) - Flattens nested object structures
- Identifies array fields that need child tables
For each endpoint, creates:
- Parent table with scalar fields and localized columns
- Child tables for each array field (with
__separator) - Nested child tables for arrays within arrays (recursive)
Example: /data-sets endpoint creates:
data_sets (main table)
├── data_sets__contributors (person associations)
│ ├── data_sets__contributors__organizations (internal orgs)
│ └── data_sets__contributors__external_organizations (external orgs)
├── data_sets__persons (alternative person field)
│ ├── data_sets__persons__organizations
│ └── data_sets__persons__external_organizations
├── data_sets__descriptions (localized descriptions)
├── data_sets__links (related URLs)
└── ... (40+ tables for complete data model)
For each item retrieved from the API:
Localized Strings:
{"title": {"en_GB": "Dataset Title", "da_DK": "Datasæt Titel"}}→ Creates columns: title_en_gb, title_da_dk
Nested Objects:
{"role": {"uri": "/path", "term": {"en_GB": "Recipient"}}}→ Creates columns: role_uri, role_term_en_gb
Person References:
{"person": {"systemName": "Person", "uuid": "abc-123"}}→ Creates column: person_uuid
Arrays:
{"organizations": [
{"uuid": "org-1"},
{"uuid": "org-2"}
]}→ Creates child table rows with foreign keys back to parent
All child tables include:
- Composite foreign keys to parent (handles nested tables)
ordcolumn to preserve array order- Raw JSON for complex objects (audit trail)
- Automatic cascade deletes when parent is removed
CREATE TABLE "data_sets" (
"uuid" TEXT PRIMARY KEY, -- Primary key (auto-detected)
"raw_json" TEXT, -- Full JSON from API
"hash" TEXT, -- Content hash for change detection
"first_seen" TEXT, -- ISO timestamp
"last_seen" TEXT, -- ISO timestamp
"title_en_gb" TEXT, -- Localized title
"title_da_dk" TEXT,
"type_uri" TEXT, -- Flattened nested object
"type_term_en_gb" TEXT,
"managing_organization_uuid" TEXT, -- Reference extraction
-- ... all other scalar fields
);CREATE TABLE "data_sets__contributors" (
"data_sets_uuid" TEXT, -- FK to parent
"ord" INTEGER, -- Array position
"name_first_name" TEXT, -- Unpacked fields
"name_last_name" TEXT,
"person_uuid" TEXT,
"role_uri" TEXT,
"role_term_en_gb" TEXT,
"type_discriminator" TEXT, -- Union type identifier
"raw_json" TEXT,
PRIMARY KEY ("data_sets_uuid", "ord"),
FOREIGN KEY ("data_sets_uuid") REFERENCES "data_sets"("uuid") ON DELETE CASCADE
);CREATE TABLE "data_sets__contributors__organizations" (
"data_sets__contributors_data_sets_uuid" TEXT, -- FK part 1
"data_sets__contributors_ord" TEXT, -- FK part 2
"ord" INTEGER, -- This array's position
"uuid" TEXT, -- Organization UUID
PRIMARY KEY (
"data_sets__contributors_data_sets_uuid",
"data_sets__contributors_ord",
"ord"
),
FOREIGN KEY (
"data_sets__contributors_data_sets_uuid",
"data_sets__contributors_ord"
) REFERENCES "data_sets__contributors"(
"data_sets_uuid",
"ord"
) ON DELETE CASCADE
);SELECT
ds.title_en_gb,
c.name_first_name || ' ' || c.name_last_name as contributor_name,
c.person_uuid,
c.role_term_en_gb as role
FROM data_sets ds
JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY c.ord;SELECT
ds.title_en_gb as dataset_title,
c.name_first_name || ' ' || c.name_last_name as contributor_name,
c.type_discriminator,
org.uuid as org_uuid,
'Internal' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
AND org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
UNION ALL
SELECT
ds.title_en_gb,
c.name_first_name || ' ' || c.name_last_name,
c.type_discriminator,
ext_org.uuid,
'External' as org_type
FROM data_sets ds
JOIN data_sets__contributors c
ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__external_organizations ext_org
ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
ORDER BY contributor_name, org_type;SELECT
ds.title_en_gb,
COUNT(DISTINCT c.person_uuid) as num_contributors,
COUNT(DISTINCT org.uuid) as num_internal_orgs,
COUNT(DISTINCT ext_org.uuid) as num_external_orgs
FROM data_sets ds
LEFT JOIN data_sets__contributors c ON ds.uuid = c.data_sets_uuid
LEFT JOIN data_sets__contributors__organizations org
ON org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
AND org.data_sets__contributors_ord = c.ord
LEFT JOIN data_sets__contributors__external_organizations ext_org
ON ext_org.data_sets__contributors_data_sets_uuid = c.data_sets_uuid
AND ext_org.data_sets__contributors_ord = c.ord
WHERE ds.uuid = '2342b9ed-44cd-4c82-a1cb-79aa2f2aeba5'
GROUP BY ds.uuid, ds.title_en_gb;The tool is fully idempotent - you can safely re-run the same command multiple times:
# Run 1: Partial completion (network issues, Ctrl+C, etc.)
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discover
# Run 2: Resume from where you left off - no duplicates!
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --discoverContent Hash Tracking:
- Each item gets a SHA-256 hash of its JSON content
- On upsert, the tool checks if the hash has changed
- Only updates the database if content actually changed
- Tracks
first_seenandlast_seentimestamps
Result:
- ✅ No duplicate data
- ✅ Detects and updates changed records
- ✅ Safe to interrupt and resume
- ✅ Incremental updates work perfectly
Example Use Cases:
- Interrupted runs: Press Ctrl+C, then re-run the same command
- Network failures: Automatic retry handles transient errors, re-run handles persistent failures
- Incremental updates: Run weekly to capture new/changed records
- Testing: Use
--limit 100first, then run without limit to complete
For even faster ingestion, you can run multiple endpoints in parallel using separate processes:
# Terminal 1
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /data-sets
# Terminal 2
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /research-outputs
# Terminal 3
python pure_openapi_ingest.py --openapi ./openapi.yaml --db ./pure.sqlite --paths /activitiesSQLite's WAL mode enables concurrent writes from multiple processes.
max_depth(default: 5) - Maximum nesting depth for recursive unpackingprefix_separator(default: "_") - Separator for nested field namesarray_table_separator(default: "__") - Separator for child table namesskip_empty_arrays(default: True) - Don't insert rows for empty arrayscreate_indexes(default: True) - Create indexes on foreign keys
Check if the field is being detected as an array in the schema:
-- Should show tables like parent__fieldname
SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;Ensure foreign keys are enabled:
PRAGMA foreign_keys=ON;The tool only creates columns for fields that exist in the OpenAPI schema. If a field appears in the JSON but not in the database:
- Check if the field is defined in the schema
- Verify the schema resolver is properly handling
allOf/oneOf - Use
--verboseto see which fields are being detected
For large datasets:
- Increase
--page-size(up to 1000) - Adjust
--rpsbased on your API rate limits - Use
--limitfor testing before full ingestion - Consider
--no-indexesduring bulk load, add indexes after
- Maximum depth limit prevents infinite recursion
- Cycle detection breaks circular schema references
- Stack tracking prevents duplicate resolution
- Caches resolved schemas to avoid recomputation
- Handles
$ref,allOf,oneOf,anyOf - Expands discriminator mappings
- Merges inherited properties from base schemas
- Auto-detects dates, enums, references
- Recognizes locale dictionaries by pattern
- Distinguishes scalars vs objects vs arrays
- Maps OpenAPI types to SQLite types
This tool is provided under the MIT License. See LICENSE file for details.
Contributions are welcome! Please feel free to submit issues or pull requests.
For issues or questions:
- Check the SQLite database structure matches expectations
- Use
--verbosefor detailed execution logs - Test with
--limit 10before full ingestion - Review the OpenAPI schema for field definitions
- Check existing issues on GitHub before creating new ones
This tool was developed for use with Elsevier Pure API systems. It is not officially affiliated with or endorsed by Elsevier.