-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hook for managing directories and files in Azure Data Lake Storage Gen2 #28262
Conversation
39e6f88
to
bfede61
Compare
Relates to #28223 |
5f3b883
to
bfd98df
Compare
@tatiana @luanmorenomaciel Need your review on this PR. |
from airflow.hooks.base import BaseHook | ||
|
||
|
||
class AzureDataLakeStorageV2(BaseHook): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class AzureDataLakeStorageV2(BaseHook): | |
class AzureDataLakeStorageV2Hook(BaseHook): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I changed the Hook name now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this.
eventually v2 will be the standard and on that day it would be redundant to mention v2.
I see that all current v1 services are listed as AzueDataLakeStorage*
WDYT about having the new ones as Adls*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current v1 services are listed as AzueDataLakeStorage. Now it is AzureDataLakeStorageClient
can it be AzureDataLakeStorageClientHook
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have hook for adls it can be anything we want.
i suggested the pattern: Adls*
so:
Current V1 | Suggested V2 |
---|---|
- | AdlsClientHook |
ADLSDeleteOperator | AdlsDeleteOperator |
ADLSListOperator | AdlsListOperator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we can also do:
microsoft/azure/hooks/adsl_v2.py
microsoft/azure/operators/adsl_v2.py
then class names can have the same names as v1 because they are not in the same file
and once v1 is removed we can change the files names which is easier change then changing classes names.
return DataLakeServiceClient( | ||
account_url=f"https://{conn.login}.dfs.core.windows.net", credential=token_credential, **extra | ||
) | ||
credential = conn.password |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do something similar to:
https://github.com/apache/airflow/blob/main/airflow/providers/microsoft/azure/hooks/wasb.py#L202?
I wonder how much in common the Wasb & the AzureDataLakeStorageV2 implementation have - and if it would make sense to have a BaseAzureHook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tatiana I am using WASB connection details to connect to ADLS gen2 storage account, apart the connection details nothing is similar I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a shame, @bharanidharan14 - I didn't realise things were so disconnected in Azure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tatiana ADLS gen2 is using different protocols and APIs and different client-type principles to connect. Currently, I have created separate connection types and connection details.
from airflow.providers.microsoft.azure.hooks.wasb import WasbHook | ||
|
||
|
||
class AzureDataLakeStorageClient(WasbHook): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#28223 description states that wasb is legacy if so what is the motivation to inhert from wsab for adsl gen2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for connection details, I thought of using these WASB connection details. should I create a new connection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If wasb is legacy and we expect it to be removed some day then I don't think we utlize it for other services
but I'm not Azure expert - I leave this desicion to people who knows the service.
What i do want to have here is docs and specifically connection docs that explains how to make this work.
(I want to avoid questions raised down the road by users who will ask to refactor because we utlize legacy service)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let me add new connection, with docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharanidharan14 @eladkal did we consider having an AzureBaseHook
similar to GoogleBaseHook
https://github.com/apache/airflow/blob/main/airflow/providers/google/common/hooks/base_google.py
This way, both AzureDataLakeStorageHook
and WasbHook
could inherit from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tatiana and @mateusoliveiraowshq as per our conversation, implementing the AzureBaseHook
would be challenging since Azure unfortunately does not share the same connection client principles (BlobServiceClient & DataLakeServiceClient) between the WASB and ABFS protocol.
Azure Blob Storage - WASB
Azure Data Lake Storage Gen2 - ABFS
What we could do instead of is to have the following hook structure:
airflow.providers.microsoft.azure.hooks.wasb (Already Exists)
airflow.providers.microsoft.azure.hooks.azure_data_lake (Suggested)
Meaning that as per @bharanidharan14 @eladkal conversation we would recommend to inherid the hook from the airflow.providers.microsoft.azure.hooks.azure_data_lake instead of airflow.providers.microsoft.azure.hooks.wasb, that would make more sense since WASB protocol has been marked as legacy hence the, implementation process are different from each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luanmorenomaciel I don't think inheriting from the airflow.providers.microsoft.azure.hooks.azure_data_lake is also not recommended because this ADLS existing hook is gen1 and it's being getting retired from feb. They may stop supporting for that too as well. so its better to inherit from BaseHook
. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be fine to inherit from BaseHook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luanmorenomaciel Currently I am inheriting from BaseHook. Can you also review this PR
63d5201
to
b742318
Compare
from hooks.base import BaseHook | ||
|
||
|
||
class AdlsClientHook(BaseHook): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for this to be defined inside airflow/providers/microsoft/azure/hooks/azure_data_lake
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed it
@kaxil Need your review on this PR |
Tests are failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharanidharan14 we're very close to merging this work.
There is one comment pending: https://github.com/apache/airflow/pull/28262/files#r1056265305
Rebasing and passing tests also seem to be pending.
232be61
to
84211f2
Compare
197f7a7
to
ce0ea8d
Compare
Fix mypy issues Fix mypy issues Fix mypy issues Fix mypy issues Add adls v2 connection Fix doc Fix doc Moved hook to existing data lake hook file Moved hook test case Moved hook test case Fix doc
ce0ea8d
to
fc15cc6
Compare
ADLS gen2 hook
Fix mypy issues Fix mypy issues Fix mypy issues Fix mypy issues Add adls v2 connection Fix doc Fix doc Moved hook to existing data lake hook file Moved hook test case Moved hook test case Fix doc
fc15cc6
to
e4fe4c4
Compare
74e55b7
to
c260594
Compare
Created hook for supporting ADLS gen2, which uses the WASB connection and connects to ADLS gen2 storage
Relates to #28223