-
Notifications
You must be signed in to change notification settings - Fork 266
feat(storage-azdls): Add Azure Datalake Storage support #1368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
let mut cfg = AzdlsConfig::default(); | ||
|
||
if let Some(_conn_str) = m.remove(ADLS_CONNECTION_STRING) { | ||
return Err(Error::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we get connection string parsing into OpenDAL, we should be able to call something like AzdlsConfig::try_from_connection_string()
here instead.
Then, we can also mark the ADLS_CONNECTION_STRING
constant as public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the endpoint is inferred from the fully specified paths passed to FileIO
. Other implementations like pyiceberg and Iceberg Java additionally accept a connection string property, but Java for example doesn't allow the user to set the endpoint explicitly.
I'd suggest to release without connection string parsing because it's already usable, but not add an endpoint property for now to not clutter the configuration options.
crates/iceberg/src/io/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some reordering here. I felt like it was more obvious to separate the feature-flagged mod
and use
statements from the non-feature-flagged ones. The #[cfg(...
lines made it easy for me to overlook that something non-integration related was declared (e.g. pub(crate) mod object_cache;
).
I also separated the feature-flagged mod
and use
statements. This allowed my IDE able to sort them, which I think makes sense with a growing number of storage integrations.
Nice work, @DerGut! Most of the changes in this PR look good to me. We can merge it after the next opendal release, which will include most of what we need here. The other option is to add AZDLS support first without |
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
The filesystem will always be empty by the point we construct the operator, so there's not point in validating it. Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
Signed-off-by: Jannik Steinmann <jannik.steinmann@datadoghq.com>
🗞️ Since this was already taking longer than expected, here is an update about what I learned in the past week or so 🎉 Some of these were misconceptions that led me to revamp parts of the PR, others are simply learnings that might be worth sharing/ documenting. Path FormatI wasn't sure which path format to expect and ended up revamping the PR to use fully qualified path notation (e.g.
Update: While S3 and GCS enforce globally unique bucket names, Azure Storage only enforces uniqueness per URI. This means that next to the container/ filesystem name, we also need to carry the account name and endpoint suffix around. While this can technically be dropped assuming an operator has been configured with this information, I think it is more consistent with other uses of Azure Storage URIs to pass along this information. WASBADLSv2, Azure's most recent HDFS-compatible storage service uses the It took me a while to wrap my head around this. But in essence I understand:
AzuriteI don't yet fully understand how Azurite (Azure's local dev storage) fits into all of this. On the one hand, it seems like it only supports the Azure Blob Storage API, and ADLSv2 APIs are missing (Azure/Azurite#553). In fact, I've tried using it with the OpenDAL Test setupdocker run -d -p 10000:10000 mcr.microsoft.com/azure-storage/azurite
az storage fs create --name test --connection-string "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;" let builder = Azdls::default()
.filesystem("test") // Created this one above
.endpoint("http://127.0.0.1:10000/devstoreaccount1")
.account_name("devstoreaccount1")
.account_key("Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==");
let op = Operator::new(builder).unwrap().finish();
op.list("/").await.unwrap(); // Fails, but succeeds for the equivalent `Azblob` On the other hand, both pyiceberg and Iceberg Java seem to be able to use Azurite for their test setup. Partially I've understood that pyiceberg's ADLS FileIO implementation ([1], [2]) is built on Blob Storage directly instead of ADLSv2. At the same time, Iceberg Java seems to use Azure's ADLS client instead. I need to dive deeper to understand why they are able to use Azurite in the Java implementation. EndpointsAzure Storage uses different endpoints for different services. For example, Blob Storage uses the In Azure SDKs, endpoints can either be set by an explicit endpoint configuration option, or by passing a connection string. The current PR implementation will validate that a configured endpoint will match what's defined in a fully qualified path. If we decide to roll with the fully qualified path format, I'd suggest to keep the configuration options as they are now because users aren't required to configure the endpoint explicitly. Also to reply to your earlier comment
Since I was taking so long, this is now already included 😬 |
🏁 My plan to get this PR to the finish line
🙏 And what I need from the community/ reviewers
|
Which issue does this PR close?
What changes are included in this PR?
This PR adds an integration for the Azure Datalake storage service. At it's core, it adds parsing logic for configuration properties. The finished config struct is simply passed down to OpenDAL. In addition it adds logic to parse fully qualified file URIs, and matches it against expected (previously configured) values.
It also creates a new
Storage::Azdls
enum variant based on OpenDAL's existingScheme::Azdls
enum variant. It then fits the parsing logic into the existing framework to build the storage integration from anio::FileIOBuilder
.Note on WASB support
Other Iceberg ADLS integrations (pyiceberg + Java) also support the
wasb://
andwasbs://
schemes.WASB refers to a client-side implementation of hierarchical namespaces on top of Blob Storage. ADLS(v2) on the other hand is a service offered by Azure, also built on top of Blob Storage.
IIUC we can accept both schemes because objects written to Blob Storage via
wasb://
will also be accessible viaadfs://
(which operates on the same Blob Storage).Even though the URIs slightly differ in format when they refer to the same object, we can largely reuse existing logic.
Are these changes tested?
Unit
I added minor unit tests to validate the configuration property parsing logic.
Integration
I decided not to add integration tests because
End-to-end
I have yet to test it in a functioning environment.