feat(storage-azdls): Add Azure Datalake Storage support #1368

DerGut · 2025-05-22T10:54:52Z

Which issue does this PR close?

Closes FileIO storage support for ADLSv2 #1360.

What changes are included in this PR?

This PR adds an integration for the Azure Datalake storage service. At it's core, it adds parsing logic for configuration properties. The finished config struct is simply passed down to OpenDAL. In addition it adds logic to parse fully qualified file URIs, and matches it against expected (previously configured) values.

It also creates a new Storage::Azdls enum variant based on OpenDAL's existing Scheme::Azdls enum variant. It then fits the parsing logic into the existing framework to build the storage integration from an io::FileIOBuilder.

Note on WASB support

Other Iceberg ADLS integrations (pyiceberg + Java) also support the wasb:// and wasbs:// schemes.
WASB refers to a client-side implementation of hierarchical namespaces on top of Blob Storage. ADLS(v2) on the other hand is a service offered by Azure, also built on top of Blob Storage.
IIUC we can accept both schemes because objects written to Blob Storage via wasb:// will also be accessible via adfs:// (which operates on the same Blob Storage).
Even though the URIs slightly differ in format when they refer to the same object, we can largely reuse existing logic.

-wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
+adfs[s]://<filesystemname>@<accountname>.dfs.core.windows.net/<path>

Are these changes tested?

Unit

I added minor unit tests to validate the configuration property parsing logic.

Integration

I decided not to add integration tests because

ADLS is not S3-compatible which means that we can't reuse our Minio setup
the Azure-specific alternative to local testing - Azurite - doesn't support ADLS

End-to-end

I have yet to test it in a functioning environment.

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

DerGut · 2025-05-22T11:15:32Z

crates/iceberg/src/io/storage_azdls.rs

+    let mut cfg = AzdlsConfig::default();
+
+    if let Some(_conn_str) = m.remove(ADLS_CONNECTION_STRING) {
+        return Err(Error::new(


When we get connection string parsing into OpenDAL, we should be able to call something like AzdlsConfig::try_from_connection_string() here instead.

Then, we can also mark the ADLS_CONNECTION_STRING constant as public.

Currently, the endpoint is inferred from the fully specified paths passed to FileIO. Other implementations like pyiceberg and Iceberg Java additionally accept a connection string property, but Java for example doesn't allow the user to set the endpoint explicitly.

I'd suggest to release without connection string parsing because it's already usable, but not add an endpoint property for now to not clutter the configuration options.

DerGut · 2025-05-22T11:20:41Z

crates/iceberg/src/io/mod.rs

I did some reordering here. I felt like it was more obvious to separate the feature-flagged mod and use statements from the non-feature-flagged ones. The #[cfg(... lines made it easy for me to overlook that something non-integration related was declared (e.g. pub(crate) mod object_cache;).

I also separated the feature-flagged mod and use statements. This allowed my IDE able to sort them, which I think makes sense with a growing number of storage integrations.

Xuanwo · 2025-05-23T05:02:22Z

Nice work, @DerGut! Most of the changes in this PR look good to me. We can merge it after the next opendal release, which will include most of what we need here.

The other option is to add AZDLS support first without client_secret settings, and then include them in following PRs.

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

The filesystem will always be empty by the point we construct the operator, so there's not point in validating it. Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

DerGut · 2025-06-02T16:56:54Z

🗞️ Since this was already taking longer than expected, here is an update about what I learned in the past week or so 🎉

Some of these were misconceptions that led me to revamp parts of the PR, others are simply learnings that might be worth sharing/ documenting.

Path Format

I wasn't sure which path format to expect and ended up revamping the PR to use fully qualified path notation (e.g. abfss://<myfs>@<myaccount>.dfs.core.windows.net/mydir/myfile.parquet). From what I've found ([1], [2]) it seemed the most widely used format to specify Azure Storage objects. We also use it internally. Other options are a shortened format (e.g. abfss://mydir/myfile.parquet). This would be more consistent with other storage path forms (like s3:// or gs://) and seems to be what pyiceberg expects.

~~Of course, supporting both could also be an option.~~

Update: While S3 and GCS enforce globally unique bucket names, Azure Storage only enforces uniqueness per URI. This means that next to the container/ filesystem name, we also need to carry the account name and endpoint suffix around. While this can technically be dropped assuming an operator has been configured with this information, I think it is more consistent with other uses of Azure Storage URIs to pass along this information.

WASB

ADLSv2, Azure's most recent HDFS-compatible storage service uses the abfs[s]:// protocol scheme. It's built on Blob Storage and comes with HDFS abstractions as a service. IIUC there was a before (even before ADLSv1), when WASB implemented similar APIs directly on Blob Storage, but client-side.
Iceberg Java ships support for the wasb[s]:// scheme using the same FileIO implementation. Supporting it in pyiceberg is in ~discussion.

It took me a while to wrap my head around this. But in essence I understand:

wasb can access the same objects because they are ultimately stored in some Blob Storage container
we can treat wasb://*.blob.* paths as abfs://*.dfs* paths and simply use the server-side ADLS-v2 APIs

Azurite

I don't yet fully understand how Azurite (Azure's local dev storage) fits into all of this. On the one hand, it seems like it only supports the Azure Blob Storage API, and ADLSv2 APIs are missing (Azure/Azurite#553). In fact, I've tried using it with the OpenDAL services-azdls and didn't get it to work.

Test setup

docker run -d -p 10000:10000 mcr.microsoft.com/azure-storage/azurite
az storage fs create --name test --connection-string "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"

let builder = Azdls::default()
    .filesystem("test") // Created this one above
    .endpoint("http://127.0.0.1:10000/devstoreaccount1")
    .account_name("devstoreaccount1")
   .account_key("Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==");

let op = Operator::new(builder).unwrap().finish();
op.list("/").await.unwrap();  // Fails, but succeeds for the equivalent `Azblob`

On the other hand, both pyiceberg and Iceberg Java seem to be able to use Azurite for their test setup. Partially I've understood that pyiceberg's ADLS FileIO implementation ([1], [2]) is built on Blob Storage directly instead of ADLSv2. At the same time, Iceberg Java seems to use Azure's ADLS client instead. I need to dive deeper to understand why they are able to use Azurite in the Java implementation.

Endpoints

Azure Storage uses different endpoints for different services. For example, Blob Storage uses the https://<account>.blob.<suffix> endpoint while ADLS uses the https://<account>.dfs.<suffix> one. These are the endpoints the underlying HTTP client will use to send requests to.
The current PR implementation expects a fully-qualified path to objects/ files in ADLS. This means we can construct the endpoint from any path, e.g. abfss://myfs@myaccount.dfs.core.windows.net/dir/file.parquet would become the endpoint https://myaccount.dfs.core.windows.net.

In Azure SDKs, endpoints can either be set by an explicit endpoint configuration option, or by passing a connection string. The current PR implementation will validate that a configured endpoint will match what's defined in a fully qualified path.

If we decide to roll with the fully qualified path format, I'd suggest to keep the configuration options as they are now because users aren't required to configure the endpoint explicitly.
If we decide to use the short path notation instead, we could either introduce a new endpoint property (only pyiceberg has it, Java relies on the connection string), or wait for a new OpenDAL version to introduce connection string support.

Also to reply to your earlier comment

The other option is to add AZDLS support first without client_secret settings, and then include them in following PRs.

Since I was taking so long, this is now already included 😬

DerGut · 2025-06-02T17:08:19Z

🏁 My plan to get this PR to the finish line

abandon Azurite support (for now): Unfortunately this sacrifices integration tests and easy local development. But for the latter at least, users can still iterate with a different FileIO implementation
test it: I'm planning some file operations against a real ADLS account

🙏 And what I need from the community/ reviewers

Path format: Do we want to settle on the long form, the short one or both? -> long
Defensive/ typed vs. concise/ dynamic-ish code: I've defaulted to a more defensive programming approach when implementing the path+configuration parsing+matching logic. This also helped me to understand what I was working with. To be fair, this bloats the code and if preferred we can still simplify it.

DerGut added 2 commits May 22, 2025 12:17

Add ADLS storage support

06ccb32

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

Update README and remove from storage-all

ef9a16c

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

DerGut changed the title ~~Add ADLS storage support~~ feat(storage-azdls): Add Azure Datalake Storage support May 22, 2025

DerGut commented May 22, 2025

View reviewed changes

Update OpenDAL

764a29f

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

DerGut force-pushed the storage-azdls branch from 51de4e4 to 764a29f Compare June 1, 2025 19:21

DerGut added 5 commits June 2, 2025 00:46

Anticipate fully qualified paths

6435da4

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

Fix clippy warnings

24ac442

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

Merge branch 'main' into storage-azdls

ed7b242

Remove filesystem comparison

c619d61

The filesystem will always be empty by the point we construct the operator, so there's not point in validating it. Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

Feature-flag azure import

78dc38a

Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>

DerGut force-pushed the storage-azdls branch from 593500b to 78dc38a Compare June 1, 2025 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(storage-azdls): Add Azure Datalake Storage support #1368

feat(storage-azdls): Add Azure Datalake Storage support #1368

Uh oh!

DerGut commented May 22, 2025 •

edited

Loading

Uh oh!

DerGut May 22, 2025

Uh oh!

DerGut Jun 2, 2025 •

edited

Loading

Uh oh!

DerGut May 22, 2025

Uh oh!

Xuanwo commented May 23, 2025 •

edited

Loading

Uh oh!

DerGut commented Jun 2, 2025 •

edited

Loading

Uh oh!

DerGut commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat(storage-azdls): Add Azure Datalake Storage support #1368

Are you sure you want to change the base?

feat(storage-azdls): Add Azure Datalake Storage support #1368

Uh oh!

Conversation

DerGut commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Note on WASB support

Are these changes tested?

Unit

Integration

End-to-end

Uh oh!

DerGut May 22, 2025

Choose a reason for hiding this comment

Uh oh!

DerGut Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DerGut May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DerGut commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Path Format

WASB

Azurite

Endpoints

Uh oh!

DerGut commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏁 My plan to get this PR to the finish line

🙏 And what I need from the community/ reviewers

Uh oh!

Uh oh!

DerGut commented May 22, 2025 •

edited

Loading

DerGut Jun 2, 2025 •

edited

Loading

Xuanwo commented May 23, 2025 •

edited

Loading

DerGut commented Jun 2, 2025 •

edited

Loading

DerGut commented Jun 2, 2025 •

edited

Loading