feat(ingestion/s3,ingestion/abs): add .zip archive support to data lake connectors by acrylJonny · Pull Request #17006 · datahub-project/datahub

acrylJonny · 2026-04-13T19:31:32Z

Summary

Extends the S3 and Azure Blob Storage (ABS) data lake connectors to support .zip archives in addition to the existing .gz and .bz2 compression formats.

Unlike .gz/.bz2 (single-stream, transparently decompressed by smart_open),.zip is a multi-file archive whose central directory lives at the end of the file. Efficient reading therefore requires random access rather than streaming. This PR implements that using HTTP range requests, avoiding the need to download the entire archive before inspecting it.

Changes

Core — schema inference

Added SeekableS3File and SeekableABSFile — file-like wrappers that satisfy zipfile.ZipFile's seekable interface by issuing byte-range requests (Range: bytes=X-Y) to S3 / Azure Blob Storage.
Added _open_zip_entry() to both S3Source and ABSSource. Opens the first entry with a supported extension (.csv, .json, .parquet, etc.) and returns its bytes as an io.BytesIO along with the inner extension, which is then used for schema inference exactly like any other file format.
Archives containing more than one file log a warning and process only the first matching entry ("single-entry" policy).
Added "zip" to SUPPORTED_COMPRESSIONS in PathSpec. Validation accepts .csv.zip, .json.zip, .parquet.zip, etc. as valid include patterns.

Core — Spark profiling (`s3/profiling.py`)

Text formats (.csv.zip, .json.zip, .tsv.zip) are handled transparently by Hadoop's built-in ZipCodec — the original S3 path is passed to Spark unchanged.
Binary formats (.parquet.zip, .avro.zip) bypass Hadoop's codec factory, so a new _extract_zip_to_tmp() helper downloads the archive, extracts the inner file to a NamedTemporaryFile, and passes the local path to Spark. The temp
file is deleted in a finally block regardless of success or failure.

github-actions · 2026-04-13T19:32:01Z

Linear: ING-2266

codecov · 2026-04-13T19:35:10Z

Codecov Report

❌ Patch coverage is 76.95853% with 50 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...stion/src/datahub/ingestion/source/s3/profiling.py	50.74%	33 Missing ⚠️
...gestion/src/datahub/ingestion/source/abs/source.py	85.13%	11 Missing ⚠️
...ngestion/src/datahub/ingestion/source/s3/source.py	92.10%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-04-13T20:24:48Z

Connector Tests Results

All connector tests passed for commit a03f20d

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

initial commit

2223cb4

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Apr 13, 2026

github-actions bot deployed to datahub-wheels (Preview) April 13, 2026 19:33 View deployment

Update profiling.py

a03f20d

github-actions bot deployed to datahub-wheels (Preview) April 13, 2026 19:41 View deployment

vercel bot deployed to Preview April 13, 2026 19:54 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion/s3,ingestion/abs): add .zip archive support to data lake connectors#17006

feat(ingestion/s3,ingestion/abs): add .zip archive support to data lake connectors#17006
acrylJonny wants to merge 2 commits intomasterfrom
s3-zip-ext

acrylJonny commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

codecov bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acrylJonny commented Apr 13, 2026

Summary

Changes

Core — schema inference

Core — Spark profiling (s3/profiling.py)

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests bot commented Apr 13, 2026

Connector Tests Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Core — Spark profiling (`s3/profiling.py`)

codecov bot commented Apr 13, 2026 •

edited

Loading