Skip to content

shard-c6/dehelpers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

dehelpers logo

Lightweight, production-hardened Python utilities for data engineering pipelines.

PyPI version Python versions CI Status License Downloads


What It Does

  • 🌐 Resilient HTTP client for ETL pipelines with bounded retries and exponential backoff.
  • πŸ—„οΈ PostgreSQL helper with safe pooling, sessions, and auto-rollback.
  • πŸ“ Structured JSON logging with automatic deep secret redaction.

Quickstart

pip install dehelpers

A complete pipeline in under 15 lines:

from dehelpers import ResilientClient, DatabaseManager, get_logger

log = get_logger("my_pipeline", job_id="daily-sync")
client = ResilientClient()

# Connects automatically via DATABASE_URL env var
with DatabaseManager() as db, client:
    users = client.get("https://jsonplaceholder.typicode.com/users").json()
    log.info("Fetched users", extra={"count": len(users)})

    with db.session() as session:
        for user in users:
            session.execute(
                "INSERT INTO users (id, name) VALUES (:id, :name) ON CONFLICT DO NOTHING",
                {"id": user["id"], "name": user["name"]}
            )
    log.info("Ingestion complete")

Documentation & Links

  • πŸ“š Documentation: Installation, Getting Started, and FAQ
  • πŸ“– API Reference: Full details on every class and function
  • πŸ’‘ Examples: Runnable scripts for HTTP, DB, and Logging
  • πŸ“ Medium Article: The story behind building this library

Architecture & Flow

dehelpers architecture

(For an interactive version of this diagram, see the Architecture Docs)


Boundaries & Capabilities

Here is exactly what this package is and what it is not:

Category / Layer What this IS What this IS NOT
API / HTTP A retry-protected wrapper around requests.Session with exponential backoff, jitter, and simple pagination. An asynchronous network library (like aiohttp or httpx), fully-fledged HTTP client replacement, or GraphQL API wrapper.
Database A thread-safe connection manager for PostgreSQL with pooling configuration, automated transaction commits/rollbacks, and lazy DataFrame output. An Object-Relational Mapper (ORM) (like SQLModel/SQLAlchemy ORM), schema migration engine (like Alembic), or database administration tool.
Logging A zero-dependency structured JSON formatter on top of standard logging with automatic deep secrets redaction. A log routing system (like Fluentd/Logstash), file logger, metrics exporter, or complex log management server.
Execution Context Designed for batch execution environments like Airflow tasks, ETL scripts, and containerized Docker runtimes. Suitable for high-throughput, low-latency, real-time web servers or async microservices.

Comparison with Standard Setup

How this package compares to a standard DIY setup:

Feature / Criteria Standard Setup (requests + logging + psycopg) dehelpers
Secret Leakage Protection Manual / None. Secrets easily print to stdout or appear in exception tracebacks. Automatic & Deep Recursive: Redacts predefined secrets from nested metadata, logs, and query parameters.
Retry & Jitter Strategy Manual loops or boilerplate urllib3 retry configurations. Out-of-the-box resilience: Exponential backoff with random jitter and clock-based total_timeout limit.
Pagination Handling Custom pagination loop logic required for every API endpoint. Next-link strategy Protocol: Yields individual items transparently and safely with validation.
Connection Safety Connection leaks or transaction rollback failures if block managers are missed. Context-managed Session: Engine-pooled with pre-ping checks, pool timeout, and auto-rollback.
Dependency Footprint Heavy setup if installing frameworks like Loguru, Structlog, or heavy database utilities. Ultra-lightweight: Base dependencies are minimal. Pandas is entirely optional and lazy-loaded.

Configuration

Parameter Default Description
DATABASE_URL (env var) β€” PostgreSQL connection string (fallback when dsn is not passed)
pool_size 5 Persistent connections in the pool
max_overflow 2 Extra connections beyond pool_size
pool_recycle 1800 Seconds before connection recycling
pool_pre_ping True Health-check connections before use
pool_timeout 30 Seconds to wait for a pool connection

Security

Automatic Redaction

The logger and API client automatically redact values for these keys in log output:

password, secret, token, api_key, authorization, dsn, connection_string, credential, passphrase, private_key, client_secret

Matching is case-insensitive substring β€” e.g. db_password matches password.

You can extend the redaction list:

from dehelpers._redact import redact_dict

result = redact_dict(
    {"my_custom_secret": "value"},
    extra_sensitive_keys=frozenset({"my_custom_secret"}),
)

⚠️ Never Embed Secrets in URLs

URL query parameter values are redacted, but path segments are not. Never construct URLs like:

https://api.example.com/v1/token/abc123/data  # BAD β€” token in path

Instead, pass secrets via headers or request body.


Fork Safety (Airflow / Multiprocessing)

If you use DatabaseManager in a forked environment (e.g. Airflow workers, multiprocessing), you must either:

  1. Create the DatabaseManager inside each worker process, or
  2. Call db.dispose() before forking.

SQLAlchemy connection pools are not safe to share across forked processes.


Testing

Unit tests (no PostgreSQL required)

pip install -e ".[dev,dataframe]"
pytest -v --tb=short -m "not postgres"

PostgreSQL integration tests

# Start a local PostgreSQL
docker run -d --name pg-test -e POSTGRES_PASSWORD=test -p 5432:5432 postgres:16

# Run integration tests
DATABASE_URL="postgresql+psycopg://postgres:test@localhost:5432/postgres" \
    pytest -m postgres -v

Coverage

pytest --cov=dehelpers --cov-report=term-missing -m "not postgres"

Developer Resources & Standards

To ensure the library remains production-grade, reliable, and easily maintainable, we enforce the following open-source standards:

  • CONTRIBUTING.md: Guidelines for cloning the fork, setting up local editable environments, running unit tests, and opening PRs.
  • CODE_OF_CONDUCT.md: Our pledge to foster an inclusive, welcoming, and harassment-free community.
  • CHANGELOG.md: Structured history of features, bugfixes, and breaking changes.
  • LICENSE: Permissive MIT License.

License

Distributed under the MIT License. See LICENSE for more information.

About

Lightweight, production-hardened Python utilities for data engineering: resilient HTTP, PostgreSQL pooling, and structured JSON logging with automatic secret redaction.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages