Skip to content

Shouldn't API client pass steam = True to the requests when downloading datasets? #754

Open
@i-aki-y

Description

@i-aki-y

Since the requests package seem to exhaust system RAM as default behavior, I think some api should pass steam = True that allows chunked download. Current implementation hardcode as stream=None (equivalent to False) and this can make the user's system unstable when downloading large datasets.

settings = self._session.merge_environment_settings(http_request.url, {}, None, None, None)

The download_file method in KaggleApi class tries to support chunked downloads but I am not sure this code works as expected because the downloading would be considered complete at this point.

for data in response.iter_content(chunk_size):

And I think the current usage of the kaggle.http_client() outside of the with self.build_kaggle_client() as kaggle: statement is not recommended because resource managed by kaggle object might be closed outside the with statement.

with self.build_kaggle_client() as kaggle:
  ...

download_file(..., kaggle.http_client(), ...)

ex.

with self.build_kaggle_client() as kaggle:
request = ApiDownloadDataFileRequest()
request.competition_name = competition
request.file_name = file_name
response = kaggle.competitions.competition_api_client.download_data_file(request)
url = response.history[0].url
outfile = os.path.join(effective_path, url.split('?')[0].split('/')[-1])
if force or self.download_needed(response, outfile, quiet):
self.download_file(response, outfile, kaggle.http_client(), quiet, not force)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions