-
Notifications
You must be signed in to change notification settings - Fork 103
Improve upsert throughput by 3x #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e248553
to
1b8488a
Compare
Upgrade protobuf from v3 to v4. This adds a number of performance improvements in parsing / serialization as documented at https://protobuf.dev/news/2022-05-06/#python-updates This increases upsert() throughput by 3x (measured by upserting 1M 768 dimension indexes to a pod-based index in batches of 500): Before: 880 vectors/sec After: 2580 vectors/sec As per the documentation, this results in an incompatible change with the _generated_ Python code, so this depends on a related change to pinecone-protos to change the version of protobuf used to generate the Python code there. This patch: - Changes the version of protobuf package from v3 to current latest v4 (v4.25.3). - Updates the generated pb2 files from pinecone-protos to the protobuf v4 generated version. - Switches from grpc-gateway-protoc-gen-openapiv2 to protoc-gen-openapiv2 as the support library needed to pull in the autogeneated openapi annotations dependancy - which changes format between v3 and v4. - protobuf v4 returns less specific exception messages during serialisation in some cases, for those which we are testing for add explicit checks to return simialr messages as v3 did.
1b8488a
to
dc3d946
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward. A couple things:
- I left a question below about the new dependency and if it's needed by the user at runtime
- Also, I want to make sure I understand the process you followed to generate this so we don't blow away your changes next time we make an update. We can discuss this over slack.
Except for that, I think this should be ready to merge. I appreciate that you added the nice error message.
googleapis-common-protos = { version = ">=1.53.0", optional = true } | ||
lz4 = { version = ">=3.1.3", optional = true } | ||
protobuf = { version = "~=3.20.0", optional = true } | ||
protobuf = { version = "^4.25", optional = true } | ||
protoc-gen-openapiv2 = {version = "^0.0.1", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This protoc-gen-openapiv2
dependency looks a bit sus having only 1 release, the underlying repo having no stars, and otherwise looking unofficial. How did you come across this package?
Also, is this actually something the user's app depends on at runtime or is it only used during code generation? So essentially a dev-dependency. If it's only used in the codegen process, we don't need to declare it in this deps list which are for the consumers of this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty much as sus as the package it replaced (grpc-gateway-protoc-gen-openapiv2) ;)
I came across it when searching for a newer package providing protobuf v4 Python generation for google.api.annotations - In essence this is two different people fixing a problem in upstream grpc-gateway where they fail to automatically generate code for dependant .proto files for Python - see grpc-ecosystem/grpc-gateway#1785
The alternative would be to essentially roll what this package does (generate Python bindings from google.api.annotations
etc) into our own package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @jhamon's question around dev dependency vs. runtime dependency: do we need to move this at all or is it essentially replacing what was there and is still needed for runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's replacing the previous package, so we still need it at runtime (see the import on line 18 of pinecone/core/grpc/protos/vector_service_pb2.py):
from protoc_gen_openapiv2.options import annotations_pb2 as protoc__gen__openapiv2_dot_options_dot_annotations__pb2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this one down and adding in some additional improvements. Also pushing up the corresponding changes to the pinecone-protos
repo, thanks!
googleapis-common-protos = { version = ">=1.53.0", optional = true } | ||
lz4 = { version = ">=3.1.3", optional = true } | ||
protobuf = { version = "~=3.20.0", optional = true } | ||
protobuf = { version = "^4.25", optional = true } | ||
protoc-gen-openapiv2 = {version = "^0.0.1", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @jhamon's question around dev dependency vs. runtime dependency: do we need to move this at all or is it essentially replacing what was there and is still needed for runtime?
* 'main' of github.com:pinecone-io/pinecone-python-client: [skip ci] Bump version to v4.1.0 Bump tqdm from 4.66.1 to 4.66.3 (pinecone-io#344) Bump idna from 3.4 to 3.7 (pinecone-io#345) Bump jinja2 from 3.1.3 to 3.1.4 (pinecone-io#343) Add better error messages for mistaken `from_texts` and `from_documents` (pinecone-io#342) Support proxy_url and ssl_ca_certs options for gRPC (pinecone-io#341) Remove serverless public preview warnings (pinecone-io#340) [skip ci] Bump version to v4.0.0 Improve upsert throughput by 3x (pinecone-io#334) Remove `merge` workflow and update `build-and-publish-docs` workflow to be manually runnable (pinecone-io#335) [skip ci] Bump version to v3.2.2 [Fix] openapi_config deprecation warning incorrectly shown (pinecone-io#327) Add grpc unit test run, expand testing of VectorFactory (pinecone-io#326) [skip ci] Bump version to v3.2.1 Allow clients to tag requests with a source_tag (pinecone-io#324) [skip ci] Bump version to v3.2.0 Revise proxy configuration, add integration testing (pinecone-io#325) [Fix] Configuring SSL proxy via openapi_config object (pinecone-io#321) Update README.md (pinecone-io#323)
Problem
Python SDK upsert throughput is low compared to other SDKs - for example I can achive 880 vector upserts/sec with the Python SDK, compared to 3500 upserts/sec with the Java SDK.
Profiling the Python SDK performing these upserts shows a large percentage of time in gRPC / protobuf serialisation / deserialisation.
Solution
Upgrade protobuf from v3 to v4. This adds a number of performance improvements in parsing / serialization as documented at https://protobuf.dev/news/2022-05-06/#python-updates
This increases upsert() throughput by 3x (measured by upserting 1M 768 dimension indexes to a pod-based index in batches of 500):
As per the documentation, this results in an incompatible change with the generated Python code, so this depends on a related change to pinecone-protos to change the version of protobuf used to generate the Python code there.
Type of Change
Test Plan
Use existing regression tests.