[Feature] Read streams by 1MB chunks by default. #817

renaudhartert-db · 2024-11-05T17:14:28Z

What changes are proposed in this pull request?

This PR changes the _BaseClient to read streams by chunks of 1MB by default. 1MB was chosen as a good compromise between speed and memory usage (see PR #319).

Note that this is not a new feature per se as it was possible to configure chunk size on the returned _StreamResponse before calling its read method. However, the functionality was not easy to discover and led several users to experience memory issues. The new default behavior is more defensive.

How is this tested?

Added a few test cases to verify that streams are chunked as expected.

…atabricks-sdk-py into chunk-stream

tests/test_base_client.py

databricks/sdk/_base_client.py

pietern · 2024-11-07T14:02:22Z

Related PR: #319.

@renaudhartert-db Looking at the related PR, it's the responsibility of the caller to enable chunking or not. Was there a specific issue or regression you saw where this didn't happen or do you just want to do the right thing by default here?

renaudhartert-db · 2024-11-07T15:15:01Z

Related PR: #319.

@renaudhartert-db Looking at the related PR, it's the responsibility of the caller to enable chunking or not. Was there a specific issue or regression you saw where this didn't happen or do you just want to do the right thing by default here?

Thanks for calling this out, I've updated the PR description to clarify the intent.

databricks/sdk/_base_client.py

Signed-off-by: Renaud Hartert <renaud.hartert@databricks.com>

github-actions · 2024-11-08T11:35:17Z

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-py

Inputs:

PR number: 817
Commit SHA: b9070bc4104572b1267164813f8e534dcc0293a5

Checks will be approved automatically on success.

eng-dev-ecosystem-bot · 2024-11-08T11:35:38Z

Test Details: go/deco-tests/11741381709

### New Features and Improvements * Read streams by 1MB chunks by default. ([#817](#817)). ### Bug Fixes * Rewind seekable streams before retrying ([#821](#821)). ### Internal Changes * Reformat SDK with YAPF 0.43. ([#822](#822)). * Update Jobs GetRun API to support paginated responses for jobs and ForEach tasks ([#819](#819)). * Update PR template ([#814](#814)). ### API Changes: * Added `databricks.sdk.service.apps`, `databricks.sdk.service.billing`, `databricks.sdk.service.catalog`, `databricks.sdk.service.compute`, `databricks.sdk.service.dashboards`, `databricks.sdk.service.files`, `databricks.sdk.service.iam`, `databricks.sdk.service.jobs`, `databricks.sdk.service.marketplace`, `databricks.sdk.service.ml`, `databricks.sdk.service.oauth2`, `databricks.sdk.service.pipelines`, `databricks.sdk.service.provisioning`, `databricks.sdk.service.serving`, `databricks.sdk.service.settings`, `databricks.sdk.service.sharing`, `databricks.sdk.service.sql`, `databricks.sdk.service.vectorsearch` and `databricks.sdk.service.workspace` packages. OpenAPI SHA: 2035bf5234753adfd080a79bff325dd4a5b90bc2, Date: 2024-11-15

### New Features and Improvements * Read streams by 1MB chunks by default. ([#817](#817)). ### Bug Fixes * Rewind seekable streams before retrying ([#821](#821)). * Properly serialize nested data classes. ### Internal Changes * Reformat SDK with YAPF 0.43. ([#822](#822)). * Update Jobs GetRun API to support paginated responses for jobs and ForEach tasks ([#819](#819)). ### API Changes: * Added `service_principal_client_id` field for `databricks.sdk.service.apps.App`. * Added `azure_service_principal`, `gcp_service_account_key` and `read_only` fields for `databricks.sdk.service.catalog.CreateCredentialRequest`. * Added `azure_service_principal`, `read_only` and `used_for_managed_storage` fields for `databricks.sdk.service.catalog.CredentialInfo`. * Added `omit_username` field for `databricks.sdk.service.catalog.ListTablesRequest`. * Added `azure_service_principal` and `read_only` fields for `databricks.sdk.service.catalog.UpdateCredentialRequest`. * Added `external_location_name`, `read_only` and `url` fields for `databricks.sdk.service.catalog.ValidateCredentialRequest`. * Added `is_dir` field for `databricks.sdk.service.catalog.ValidateCredentialResponse`. * Added `only` field for `databricks.sdk.service.jobs.RunNow`. * Added `restart_window` field for `databricks.sdk.service.pipelines.CreatePipeline`. * Added `restart_window` field for `databricks.sdk.service.pipelines.EditPipeline`. * Added `restart_window` field for `databricks.sdk.service.pipelines.PipelineSpec`. * Added `private_access_settings_id` field for `databricks.sdk.service.provisioning.UpdateWorkspaceRequest`. * Changed `create_credential()` and `generate_temporary_service_credential()` methods for [w.credentials](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/credentials.html) workspace-level service with new required argument order. * Changed `access_connector_id` field for `databricks.sdk.service.catalog.AzureManagedIdentity` to be required. * Changed `access_connector_id` field for `databricks.sdk.service.catalog.AzureManagedIdentity` to be required. * Changed `name` field for `databricks.sdk.service.catalog.CreateCredentialRequest` to be required. * Changed `credential_name` field for `databricks.sdk.service.catalog.GenerateTemporaryServiceCredentialRequest` to be required. OpenAPI SHA: f2385add116e3716c8a90a0b68e204deb40f996c, Date: 2024-11-15

renaudhartert-db added 3 commits November 4, 2024 19:57

Update PR template

bf731f4

Small rewording

8b14f51

Read stream per chunk

ff5c54d

renaudhartert-db temporarily deployed to test-trigger-is November 5, 2024 17:14 — with GitHub Actions Inactive

Adjust comments

2ec34c8

renaudhartert-db temporarily deployed to test-trigger-is November 5, 2024 17:24 — with GitHub Actions Inactive

Merge branch 'main' into renaud.hartert/chunk-stream

97d0d88

renaudhartert-db temporarily deployed to test-trigger-is November 5, 2024 17:29 — with GitHub Actions Inactive

Make fmt

78290bc

renaudhartert-db requested a review from pietern November 5, 2024 17:29

Merge branch 'renaud.hartert/chunk-stream' of github.com:databricks/d…

c07a3fe

…atabricks-sdk-py into chunk-stream

renaudhartert-db temporarily deployed to test-trigger-is November 5, 2024 17:33 — with GitHub Actions Inactive

ksafonov-db reviewed Nov 7, 2024

View reviewed changes

tests/test_base_client.py Outdated Show resolved Hide resolved

databricks/sdk/_base_client.py Outdated Show resolved Hide resolved

renaudhartert-db changed the title ~~[Fix] Read streams by 1MB chunks by default.~~ [Feature] Read streams by 1MB chunks by default. Nov 7, 2024

Use random test data

0c565c8

renaudhartert-db temporarily deployed to test-trigger-is November 7, 2024 16:34 — with GitHub Actions Inactive

Make streaming_buffer_size required

b8b4285

renaudhartert-db temporarily deployed to test-trigger-is November 8, 2024 10:41 — with GitHub Actions Inactive

renaudhartert-db requested a review from ksafonov-db November 8, 2024 10:41

renaudhartert-db temporarily deployed to test-trigger-is November 8, 2024 10:41 — with GitHub Actions Inactive

pietern approved these changes Nov 8, 2024

View reviewed changes

databricks/sdk/_base_client.py Outdated Show resolved Hide resolved

Update _base_client.py

b9070bc

Signed-off-by: Renaud Hartert <renaud.hartert@databricks.com>

renaudhartert-db temporarily deployed to test-trigger-is November 8, 2024 11:35 — with GitHub Actions Inactive

renaudhartert-db enabled auto-merge November 8, 2024 11:35

renaudhartert-db added this pull request to the merge queue Nov 8, 2024

Merged via the queue into main with commit 2143e35 Nov 8, 2024
19 checks passed

renaudhartert-db deleted the renaud.hartert/chunk-stream branch November 8, 2024 12:15

This was referenced Nov 18, 2024

[Release] Release v0.38.0 #827

Closed

[Release] Release v0.38.0 #826

Merged

renaudhartert-db mentioned this pull request Nov 18, 2024

[Internal] Bump release number to 0.38.0 #828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Read streams by 1MB chunks by default. #817

[Feature] Read streams by 1MB chunks by default. #817

Uh oh!

renaudhartert-db commented Nov 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pietern commented Nov 7, 2024

Uh oh!

renaudhartert-db commented Nov 7, 2024

Uh oh!

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

eng-dev-ecosystem-bot commented Nov 8, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Feature] Read streams by 1MB chunks by default. #817

[Feature] Read streams by 1MB chunks by default. #817

Uh oh!

Conversation

renaudhartert-db commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How is this tested?

Uh oh!

Uh oh!

Uh oh!

pietern commented Nov 7, 2024

Uh oh!

renaudhartert-db commented Nov 7, 2024

Uh oh!

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

eng-dev-ecosystem-bot commented Nov 8, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

renaudhartert-db commented Nov 5, 2024 •

edited

Loading