-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Description
Outline the issue here:
Chunk upload to DLS fails if the chunk names contain the + character. Possibly (likely) other characters may be affected too.
Initially the issue was discovered when azure CLI (running on Linux) failed to upload large files with + in their remote names, while smaller files with the same name succeeded.
According to initial analysis, the issue occurs due to how the concat() function uses the MSCONCAT operation to concatenate file chunks after upload.
Reproduction Steps
** Enumerate the steps to reproduce the issue here:**
- Download the attached files
- fill in secrets to
secrets.json.tpland rename it tosecrets.json - fill in DLS account name to the
config.json.*files - copy some largeish file (larger than the default chunking limit, 256 MiB) to the same directory as
test_fileandtest+file - copy one of the
config.json.*files asconfig.json - run
python chunk_upload_example.py
When uploading a file such that the remote name contains a + character as passed to ADLUploader via the rpath argument, the upload fails, but the failure is not visibly indicated by e.g. an exception being raised (or passed to the calling Python function). When uploading a file such that the remote name does not contain a +, the upload succeeds, even if the local filename contained a +.
When the upload fails, the chunks remain in the DLS. This suggests that the failure occurs somewhere in the chain multithread.ADLUploader() -> multithread.merge_chunks() -> core.AzureDLFileSystem.concat(), which in turn makes the REST API call with operation MSCONCAT.
Merging the file chunks succeeds if the MSCONCAT WebHDFS operation (not documented in the public API documentation, but mentioned in the Swagger file here ) is called so that the + sign in the chunk names is urlencoded as %2B.
Calling the REST API endpoints directly (e.g. with Postman), assuming that the folder upload_test contains files test+001 and test+002:
- HTTP POST to
https://{{dlAccount}}.azuredatalakestore.net/webhdfs/v1/upload_test/test_plus?op=MSCONCATwith body (type raw)sources=/upload_test/test+001,/upload_test/test+002: fails - HTTP POST to
https://{{dlAccount}}.azuredatalakestore.net/webhdfs/v1/upload_test/test_plus?op=MSCONCATwith body (type raw)sources=/upload_test/test%2B001,/upload_test/test%2B002: success, filetest_plusappears in the folderupload_testand the two files disappear
Submitting as an issue instead of pull request, as it is not really possible to say whether this issue should be fixed in the Python library (e.g. using urllib.quote() in core.py) or in the DLS REST API implementation. In any case, it would be nice if all REST API operations invoked by the Azure CLI tools were documented sufficiently, including encoding issues.
Environment summary
SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here:
Name: azure-datalake-store
Version: 0.0.32
Summary: Azure Data Lake Store Filesystem Client Library for Python
Home-page: https://github.com/Azure/azure-data-lake-store-python
Author: Microsoft Corporation
Author-email: ptvshelp@microsoft.com
License: MIT License
Location: /Users/<redacted>/.pyenv/versions/3.6.6/envs/adls-3.6.6/lib/python3.6/site-packages
Requires: azure-nspkg, adal, cffi
Required-by:
Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here:
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin```
64-bit
**OS Version:** What OS and version are you using?
Answer here:
macOS Sierra, 10.12.6 (16G1510)
**Shell Type:** What shell are you using? (e.g. bash, cmd.exe, Bash on Windows)
Answer here:
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)