Skip to content

Merging files after chunked upload with + in the filename fails #245

@solita-tvoipio

Description

@solita-tvoipio

Description

Outline the issue here:

Chunk upload to DLS fails if the chunk names contain the + character. Possibly (likely) other characters may be affected too.

Initially the issue was discovered when azure CLI (running on Linux) failed to upload large files with + in their remote names, while smaller files with the same name succeeded.

According to initial analysis, the issue occurs due to how the concat() function uses the MSCONCAT operation to concatenate file chunks after upload.


Reproduction Steps

** Enumerate the steps to reproduce the issue here:**

  1. Download the attached files
  2. fill in secrets to secrets.json.tpl and rename it to secrets.json
  3. fill in DLS account name to the config.json.* files
  4. copy some largeish file (larger than the default chunking limit, 256 MiB) to the same directory as test_file and test+file
  5. copy one of the config.json.* files as config.json
  6. run python chunk_upload_example.py

When uploading a file such that the remote name contains a + character as passed to ADLUploader via the rpath argument, the upload fails, but the failure is not visibly indicated by e.g. an exception being raised (or passed to the calling Python function). When uploading a file such that the remote name does not contain a +, the upload succeeds, even if the local filename contained a +.

When the upload fails, the chunks remain in the DLS. This suggests that the failure occurs somewhere in the chain multithread.ADLUploader() -> multithread.merge_chunks() -> core.AzureDLFileSystem.concat(), which in turn makes the REST API call with operation MSCONCAT.

Merging the file chunks succeeds if the MSCONCAT WebHDFS operation (not documented in the public API documentation, but mentioned in the Swagger file here ) is called so that the + sign in the chunk names is urlencoded as %2B.

Calling the REST API endpoints directly (e.g. with Postman), assuming that the folder upload_test contains files test+001 and test+002:

  • HTTP POST to https://{{dlAccount}}.azuredatalakestore.net/webhdfs/v1/upload_test/test_plus?op=MSCONCAT with body (type raw) sources=/upload_test/test+001,/upload_test/test+002: fails
  • HTTP POST to https://{{dlAccount}}.azuredatalakestore.net/webhdfs/v1/upload_test/test_plus?op=MSCONCAT with body (type raw) sources=/upload_test/test%2B001,/upload_test/test%2B002: success, file test_plus appears in the folder upload_test and the two files disappear

Submitting as an issue instead of pull request, as it is not really possible to say whether this issue should be fixed in the Python library (e.g. using urllib.quote() in core.py) or in the DLS REST API implementation. In any case, it would be nice if all REST API operations invoked by the Azure CLI tools were documented sufficiently, including encoding issues.

Environment summary

SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here:

Name: azure-datalake-store
Version: 0.0.32
Summary: Azure Data Lake Store Filesystem Client Library for Python
Home-page: https://github.com/Azure/azure-data-lake-store-python
Author: Microsoft Corporation
Author-email: ptvshelp@microsoft.com
License: MIT License
Location: /Users/<redacted>/.pyenv/versions/3.6.6/envs/adls-3.6.6/lib/python3.6/site-packages
Requires: azure-nspkg, adal, cffi
Required-by: 

Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here:

[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin```

64-bit

**OS Version:** What OS and version are you using?  
Answer here:

macOS Sierra, 10.12.6 (16G1510)

**Shell Type:** What shell are you using? (e.g. bash, cmd.exe, Bash on Windows)  
Answer here:

GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions