Skip to content

feat(ingestion/s3,ingestion/abs): add .zip archive support to data lake connectors#17006

Open
acrylJonny wants to merge 2 commits intomasterfrom
s3-zip-ext
Open

feat(ingestion/s3,ingestion/abs): add .zip archive support to data lake connectors#17006
acrylJonny wants to merge 2 commits intomasterfrom
s3-zip-ext

Conversation

@acrylJonny
Copy link
Copy Markdown
Collaborator

Summary

Extends the S3 and Azure Blob Storage (ABS) data lake connectors to support .zip archives in addition to the existing .gz and .bz2 compression formats.

Unlike .gz/.bz2 (single-stream, transparently decompressed by smart_open),.zip is a multi-file archive whose central directory lives at the end of the file. Efficient reading therefore requires random access rather than streaming. This PR implements that using HTTP range requests, avoiding the need to download the entire archive before inspecting it.

Changes

Core — schema inference

  • Added SeekableS3File and SeekableABSFile — file-like wrappers that satisfy zipfile.ZipFile's seekable interface by issuing byte-range requests (Range: bytes=X-Y) to S3 / Azure Blob Storage.
  • Added _open_zip_entry() to both S3Source and ABSSource. Opens the first entry with a supported extension (.csv, .json, .parquet, etc.) and returns its bytes as an io.BytesIO along with the inner extension, which is then used for schema inference exactly like any other file format.
  • Archives containing more than one file log a warning and process only the first matching entry ("single-entry" policy).
  • Added "zip" to SUPPORTED_COMPRESSIONS in PathSpec. Validation accepts .csv.zip, .json.zip, .parquet.zip, etc. as valid include patterns.

Core — Spark profiling (s3/profiling.py)

  • Text formats (.csv.zip, .json.zip, .tsv.zip) are handled transparently by Hadoop's built-in ZipCodec — the original S3 path is passed to Spark unchanged.
  • Binary formats (.parquet.zip, .avro.zip) bypass Hadoop's codec factory, so a new _extract_zip_to_tmp() helper downloads the archive, extracts the inner file to a NamedTemporaryFile, and passes the local path to Spark. The temp
    file is deleted in a finally block regardless of success or failure.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2266

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 76.95853% with 50 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...stion/src/datahub/ingestion/source/s3/profiling.py 50.74% 33 Missing ⚠️
...gestion/src/datahub/ingestion/source/abs/source.py 85.13% 11 Missing ⚠️
...ngestion/src/datahub/ingestion/source/s3/source.py 92.10% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests
Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit a03f20d

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants