Skip to content

Support for parallel blob downloads #40270

Open
@martinResearch

Description

@martinResearch

Is your feature request related to a problem? Please describe.

I have a large list of blobs I want to download using the azure SDK. I can do a loop over the list and call the client's download_blob for each blob sequentially but this is very slow.

I implemented a class derived from azure.storage.blob.ContainerClient that uses ThreadPoolExecutor to do the downloads in parallel, with a new method with this interface:

    def download_blobs_to_files(
        self,
        blob_filename_pairs: Iterable[Tuple[str, str]],
        concurrency_limit: int = 1000,
        verbose: bool = False,
    ) -> int:
        """Downloads a list of files from an azure blob container.

        Args:
            blob_filename_pairs: List[Tuple[str, str]]:List of blob and local path pairs
            concurrency_limit: Maximum number of threads.
            verbose: controls verbosity of the function.

It works is it a bit brittle and it is not clear how to automatically choose the right number of threads (concurrency_limit). Ideally this would be a feature supported by the Azure SDK. It seems to me to be a frequent user need.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.Service AttentionWorkflow: This issue is responsible by Azure service team.StorageStorage Service (Queues, Blobs, Files)customer-reportedIssues that are reported by GitHub users external to the Azure organization.feature-requestThis issue requires a new behavior in the product in order be resolved.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions