Skip to content

Conversation

@mahmoudbahaa
Copy link

@mahmoudbahaa mahmoudbahaa commented Apr 1, 2023

The PR was externally tested (through java unit tests that use Hadoop File System and the abfs:// scheme and Azure sdk)

Didn't have enough time to add test cases to the repo itself. May add some in the future if I had the time. or if someone want to do that go ahead.

Address issue #553

@mahmoudbahaa
Copy link
Author

@microsoft-github-policy-service agree company="Incorta"

@mahmoudbahaa
Copy link
Author

fix existing test cases by merging different conflicting apis into one and reset enable xml for all except listpaths (done by manual change in generated code by changing isXML of the opration to false)

@blueww
Copy link
Member

blueww commented Apr 3, 2023

@mahmoudbahaa

Thanks so much for raising this PR!
It's so great to see the PR for datalake gen2 support.

However, I have a sanity check on it, and see some problems need fix:

  1. The PR doesn't introduce a new endpoint dfs, all dfs requests are handle by blob endpoint.
    This will take problem:
  • It will cause 2 blob/dfs request which has simialr parameter like Path_GetProperties and Blob_GetProperties will handled by same code, but actually in rest spec they have different parameter and responds.

  • This will take regression to exist blob API, so exist Azurite user on blob API might be blocked.

  • This will make Azurite not scalable for coming blob/dfs API change, or need much effort to avoid the regression on each endpoint.

    So to make Azurite works like product azure, we should introduce a new endpoint dfs.

  1. And some Blob endpoint API will work differently on HNS enabled account and not enabled account. So to support both account type, we might should add a new start parameter to Azurite to indicate the account enabled HNS or not. (To avoid regression, default should be not enabled.)

  2. Please write doc for which features/API has already supported, and which has not, this is very important for customer to using the feature in Azurite, and for the following improvement/trouble shooting. You can refer https://github.com/Azure/Azurite#support-matrix for how to write it.
    As I see, there might still be some features/API not implemented or handled correctly in the PR. include but not limited to:

  • Path_SetAccessControl: not implemented
  • Datalake SAS not work (include "sr=d","sdd=")
  • Path_Create : header "x-ms-permissions" and "x-ms-umask" works not well (with "x-ms-permissions: rwxrwxrwx" and "x-ms-umask: 0070", Public Azure will get "rwx---rwx", but current Azurite get "rwxrwxr--")
  • FileSystem_ListPaths/Path_GetProperties:
    • list from filesystem root failed with 404
    • Result : missing many important properties like permission/group, FileSystem_ListPaths has additional "isDirectory":false which will cause .net SDK fail,
  1. We don't allow manually update auto-generated code, or generate code again with new swagger will overwrite the manual change. The correct way is to update swagger and record the changes in Readme.md (better with directive), or update the generate code tool.

  2. As you have mentioned, test case still not added. But enough test coverage is necessary to merge the PR, to ensure quanlity and avoid regression.

  3. Missing some necessary doc changes like Readme, change log ...

@mahmoudbahaa mahmoudbahaa force-pushed the dataLakeStorage-support branch 3 times, most recently from b221b68 to 6283ad0 Compare April 10, 2023 11:22
@mahmoudbahaa
Copy link
Author

@blueww

reagrding your commments:

The PR doesn't introduce a new endpoint dfs, all dfs requests are handle by blob endpoint.

I introduced new endpoint named dfs on port 10003 which support both blob and dfs requests.

It will cause 2 blob/dfs request which has simialr parameter like Path_GetProperties and Blob_GetProperties will handled by same code, but actually in rest spec they have different parameter and responds.

I merged both parameters and responds in these cases into the dfs one and removed the blob one

This will take regression to exist blob API, so exist Azurite user on blob API might be blocked.

Changes made to blob endpoint are now minimal

And some Blob endpoint API will work differently on HNS enabled account and not enabled account. So to support both account type, we might should add a new start parameter to Azurite to indicate the account enabled HNS or not. (To avoid regression, default should be not enabled.)

for now the HNS is enabled by default in dfs endpoint since some endpoints don't work for non-HNS enabled and in datalake gen2 it makes sense to make HNS enabled by default.

Please write doc for which features/API has already supported, and which has not, this is very important for customer to using the feature in Azurite, and for the following improvement/trouble shooting. You can refer https://github.com/Azure/Azurite#support-matrix for how to write it.

done

Path_SetAccessControl: not implemented

done but the recursive one is not supported yet as in the support matrix

Datalake SAS not work (include "sr=d","sdd=")

should be working now

Path_Create : header "x-ms-permissions" and "x-ms-umask" works not well (with "x-ms-permissions: rwxrwxrwx" and "x-ms-umask: 0070", Public Azure will get "rwx---rwx", but current Azurite get "rwxrwxr--")

should be working properly now

FileSystem_ListPaths/Path_GetProperties:
list from filesystem root failed with 404
Result : missing many important properties like permission/group,

permission/group is now included in both

FileSystem_ListPaths has additional "isDirectory":false which will cause .net SDK fail,

actually according to spec in listPaths each Path should have have isDirectory flag https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list#path

We don't allow manually update auto-generated code, or generate code again with new swagger will overwrite the manual change. The correct way is to update swagger and record the changes in Readme.md (better with directive), or update the generate code tool.

I did almost all needed changes to swagger and documented in dfs.md file that generate the swagger. only remaining changes needed that is not included in the swagger but are also documented in dfs.md:

  1. isXML is set to false in listPaths (generator should respect accept property of the request and serialize according to it) [needed for hadoop]
  2. add req.getHeader("X-HTTP-Method-Override") in dispatch.middleware.ts since it can be used instead of req.getHeader("X-HTTP-Method") sometimes [also needed for hadoop]

as far as i can tell the actual generator code is private so don't have access to it to change those

As you have mentioned, test case still not added. But enough test coverage is necessary to merge the PR, to ensure quanlity and avoid regression.

test cases was added specifically for datalake under tests/dfs/storage-file-datalake

Missing some necessary doc changes like Readme, change log ...

added needed changes for readme

please check if other changes is needed

@mahmoudbahaa mahmoudbahaa force-pushed the dataLakeStorage-support branch 2 times, most recently from 48c797f to 71ab093 Compare April 11, 2023 06:42
sasProtocol: string = "https,http",
requestProtocol: string
): boolean {
if (sasProtocol.includes(",")) {

Check failure

Code scanning / CodeQL

Type confusion through parameter tampering

Potential type confusion as [this HTTP request parameter](1) may be either an array or a string.
sasProtocol: string = "https,http",
requestProtocol: string
): boolean {
if (sasProtocol.includes(",")) {

Check failure

Code scanning / CodeQL

Type confusion through parameter tampering

Potential type confusion as [this HTTP request parameter](1) may be either an array or a string.
Comment on lines +196 to +198
authenticationMiddlewareFactory.createAuthenticationMiddleware(
authenticators
)

Check failure

Code scanning / CodeQL

Missing rate limiting

This route handler performs [authorization](1), but is not rate-limited.
return this.context.account;
}

public set account(account: string | undefined) {

Check failure

Code scanning / CodeQL

Insecure randomness

This uses a cryptographically insecure random number generated at [Math.random()](1) in a security context.
@blueww
Copy link
Member

blueww commented Apr 11, 2023

@mahmoudbahaa

Thanks for the update!

As this is a big PR we might need more time to review it and will reply you later.

For the "isDirectory":false, when list path from server it won't return for file. See following responds from product server in my test.
And one important thing is return it for file will make .net sdk fail with error "The requested operation requires an element of type 'String', but the target element has type 'False'."

{"paths":[
{"contentLength":"0","creationTime":"133256687100169757","etag":"0x8DB3A575D040E1D","expiryTime":"0","group":"$superuser","isDirectory":"true","lastModified":"Tue, 11 Apr 2023 06:38:30 GMT","name":"dir1","owner":"$superuser","permissions":"rwx---rwx"},
{"contentLength":"1024","creationTime":"133256687147652690","etag":"0x8DB3A5760706833","expiryTime":"0","group":"$superuser","lastModified":"Tue, 11 Apr 2023 06:38:35 GMT","name":"dir1/__datetime=2020-09-14 09%3A00%3A00","owner":"$superuser","permissions":"rwx---rwx"},
{"contentLength":"1024","creationTime":"133256687195188636","etag":"0x8DB3A5762ADED9C","expiryTime":"0","group":"$superuser","lastModified":"Tue, 11 Apr 2023 06:38:39 GMT","name":"dir1/f%3Aile2","owner":"$superuser","permissions":"rw-r-----"},
{"contentLength":"1024","creationTime":"133256687207598063","etag":"0x8DB3A57636B47EF","expiryTime":"0","group":"$superuser","lastModified":"Tue, 11 Apr 2023 06:38:40 GMT","name":"dir1/testfile_1K_0","owner":"$superuser","permissions":"rw-r-----"},{"contentLength":"0","creationTime":"133256687133364901","etag":"0x8DB3A575EFE92A5","expiryTime":"0","group":"$superuser","isDirectory":"true","lastModified":"Tue, 11 Apr 2023 06:38:33 GMT","name":"dir2","owner":"$superuser","permissions":"rwxr-x---"}]}

@mahmoudbahaa mahmoudbahaa force-pushed the dataLakeStorage-support branch from 12f4ce9 to 6999337 Compare April 15, 2023 23:32
* Full support for sql for the same features supported by Loki except for PageBlob
* Added clearDb option to configurations in both blob and dfs and set it to true in the test cases
* Added support for using postgreSQL in dfs endpoint (needed due to limitations in mysql) without breaking mysql
* Added support for permissions save/load, acl, setExpiry, owner and group
* Updated visual studio extension file and added binary azurite-datalake,...
* Updated readme, swagger definition changes
* Updated change log
* Added dfs test pipelines to azure-pipelines
@mahmoudbahaa mahmoudbahaa force-pushed the dataLakeStorage-support branch from 2c699e3 to 6d78471 Compare April 16, 2023 00:13
@mahmoudbahaa
Copy link
Author

@blueww

  • rebased the PR
  • squashed the commits
  • addressed the isDirectory issue by setting it to undefined if it is false (since the default is false as per the specs)

@mahmoudbahaa mahmoudbahaa changed the title add support for for datalake api and implemented some methods in it in both Loki and SQL add support for for datalake api Apr 16, 2023
@mahmoudbahaa
Copy link
Author

@blueww I restructed the PR to reduce duplicate code between blob and dfs end points and splitted it into 2 phases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants