Skip to content

Future: Automatic batching - gateway-side file buffering for optimized uploads #12

@crtahlin

Description

@crtahlin

Context

This is a future enhancement that builds on #11 (manifest/collection upload support).

Benchmark testing (datafund/provenance-fellowship#22) showed that HTTP overhead (~279ms per request) dominates upload latency for small files. Manifest bundling provides 15-25x throughput improvement.

However, requiring clients to create TAR archives adds complexity. This feature would allow the gateway to handle batching transparently.

Concept

The gateway accepts sequential file uploads but buffers them internally, uploading as a manifest when thresholds are met:

Client                          Gateway                              Bee
  |                                |                                   |
  |-- POST /data (file1) -------> | [buffer]                          |
  |<-- 202 Accepted, queued ------|                                   |
  |-- POST /data (file2) -------> | [buffer]                          |
  |<-- 202 Accepted, queued ------|                                   |
  |-- POST /data (file3) -------> | [buffer reaches threshold]        |
  |                                |-- POST /bzz (manifest) ---------> |
  |                                |<-- {manifest_hash} --------------|
  |<-- {file1: hash1, ...} -------|                                   |

Possible Trigger Thresholds

Threshold Example Description
File count 100 files Upload when buffer has N files
Total size 1 MB Upload when buffer reaches N bytes
Time-based 5 seconds Upload after timeout even if other thresholds not met
Manual flush POST /flush Client explicitly triggers upload

Possible API Design

# Enable batching mode
POST /api/v1/data/?stamp_id={id}&batch=true

# With configurable thresholds
POST /api/v1/data/?stamp_id={id}&batch=true&batch_size=100&batch_timeout=5s

# Get status of pending batch
GET /api/v1/batch/status?stamp_id={id}

# Force flush pending files
POST /api/v1/batch/flush?stamp_id={id}

Response Handling Options

  1. Synchronous: Block until batch uploads, return all hashes (simpler but slower)
  2. Async with polling: Return 202, client polls for results
  3. Async with callback: Return 202, gateway calls webhook with results

Benefits

  • Transparent to client: No TAR creation needed client-side
  • Optimal batching: Gateway decides best batch size based on traffic patterns
  • Reduced complexity: Simple POST per file, gateway handles optimization

Dependencies

Status

TBD - Exact behavior, API design, and thresholds to be determined based on:


This issue tracks a future enhancement. Implementation priority depends on client needs and #11 completion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions