-
Couldn't load subscription status.
- Fork 2.3k
Description
Is your feature request related to a problem? Please describe
While doing the indexing all the fields which are getting ingested in Opensearch are stored as _source. If user requires they can disable the _source per field or completely for all the fields. But if user does this, the _recovery_source gets added(ref), which gets removed later on.
So overall the whole payload will still be used as a StoredField and impacts the indexing time. The impact on indexing time is high if one of the field is a vector field. In my experiments with 768D 1M dataset I can see a 50% reduction in indexing latency at p90 level.
Benchmarking Results
Below are the benchmarking results. Refer Appendix A for the flame graphs
Cluster Configuration
| Data Nodes | r5.2xlarge |
|---|---|
| Data Nodes Count | 3 |
| Dataset | cohere-768D |
| Number of Vectors | 1M |
| Tool | OSB |
| Indexing Clients | 1 |
| Number of Shards | 3 |
| Index Thread Qty | 1 |
| Engine | Nmslib |
| EF Construction | 256 |
| EF Search | 256 |
Baseline Results
| Metric | Task | Value | Unit |
|---|---|---|---|
| Cumulative indexing time of primary shards | 6.3266 | min | |
| Min cumulative indexing time across primary shards | 0 | min | |
| Median cumulative indexing time across primary shards | 1.99598 | min | |
| Max cumulative indexing time across primary shards | 2.27533 | min | |
| Cumulative indexing throttle time of primary shards | 0 | min | |
| Min cumulative indexing throttle time across primary shards | 0 | min | |
| Median cumulative indexing throttle time across primary shards | 0 | min | |
| Max cumulative indexing throttle time across primary shards | 0 | min | |
| Cumulative merge time of primary shards | 45.0456 | min | |
| Cumulative merge count of primary shards | 7 | ||
| Min cumulative merge time across primary shards | 0 | min | |
| Median cumulative merge time across primary shards | 14.4361 | min | |
| Max cumulative merge time across primary shards | 15.8919 | min | |
| Cumulative merge throttle time of primary shards | 0.168133 | min | |
| Min cumulative merge throttle time across primary shards | 0 | min | |
| Median cumulative merge throttle time across primary shards | 0.0523167 | min | |
| Max cumulative merge throttle time across primary shards | 0.0632667 | min | |
| Cumulative refresh time of primary shards | 6.0824 | min | |
| Cumulative refresh count of primary shards | 144 | ||
| Min cumulative refresh time across primary shards | 0 | min | |
| Median cumulative refresh time across primary shards | 1.90423 | min | |
| Max cumulative refresh time across primary shards | 2.14323 | min | |
| Cumulative flush time of primary shards | 23.2522 | min | |
| Cumulative flush count of primary shards | 32 | ||
| Min cumulative flush time across primary shards | 0 | min | |
| Median cumulative flush time across primary shards | 7.4859 | min | |
| Max cumulative flush time across primary shards | 8.08212 | min | |
| Total Young Gen GC time | 0.927 | s | |
| Total Young Gen GC count | 39 | ||
| Total Old Gen GC time | 0 | s | |
| Total Old Gen GC count | 0 | ||
| Store size | 12.4152 | GB | |
| Translog size | 5.6345e-07 | GB | |
| Min Throughput | custom-vector-bulk | 885.98 | docs/s |
| Mean Throughput | custom-vector-bulk | 1017.74 | docs/s |
| Median Throughput | custom-vector-bulk | 1007.87 | docs/s |
| Max Throughput | custom-vector-bulk | 1082.07 | docs/s |
| 50th percentile latency | custom-vector-bulk | 44.5227 | ms |
| 90th percentile latency | custom-vector-bulk | 59.4799 | ms |
| 99th percentile latency | custom-vector-bulk | 185.057 | ms |
| 99.9th percentile latency | custom-vector-bulk | 335.08 | ms |
| 99.99th percentile latency | custom-vector-bulk | 490.821 | ms |
| 100th percentile latency | custom-vector-bulk | 545.323 | ms |
Removing Recovering Source and _source
POC code: navneet1v@6c5896a
| Metric | Task | Value | Unit |
|---|---|---|---|
| Cumulative indexing time of primary shards | 5.49067 | min | |
| Min cumulative indexing time across primary shards | 1.71528 | min | |
| Median cumulative indexing time across primary shards | 1.77818 | min | |
| Max cumulative indexing time across primary shards | 1.9972 | min | |
| Cumulative indexing throttle time of primary shards | 0 | min | |
| Min cumulative indexing throttle time across primary shards | 0 | min | |
| Median cumulative indexing throttle time across primary shards | 0 | min | |
| Max cumulative indexing throttle time across primary shards | 0 | min | |
| Cumulative merge time of primary shards | 0.3878 | min | |
| Cumulative merge count of primary shards | 3 | ||
| Min cumulative merge time across primary shards | 0.118133 | min | |
| Median cumulative merge time across primary shards | 0.133017 | min | |
| Max cumulative merge time across primary shards | 0.13665 | min | |
| Cumulative merge throttle time of primary shards | 0 | min | |
| Min cumulative merge throttle time across primary shards | 0 | min | |
| Median cumulative merge throttle time across primary shards | 0 | min | |
| Max cumulative merge throttle time across primary shards | 0 | min | |
| Cumulative refresh time of primary shards | 13.38 | min | |
| Cumulative refresh count of primary shards | 86 | ||
| Min cumulative refresh time across primary shards | 4.0509 | min | |
| Median cumulative refresh time across primary shards | 4.11963 | min | |
| Max cumulative refresh time across primary shards | 5.20948 | min | |
| Cumulative flush time of primary shards | 31.9997 | min | |
| Cumulative flush count of primary shards | 27 | ||
| Min cumulative flush time across primary shards | 10.0862 | min | |
| Median cumulative flush time across primary shards | 10.2099 | min | |
| Max cumulative flush time across primary shards | 11.7035 | min | |
| Store size | 11.7547 | GB | |
| Translog size | 1.19775 | GB | |
| Min Throughput | custom-vector-bulk | 932.63 | docs/s |
| Mean Throughput | custom-vector-bulk | 1133.94 | docs/s |
| Median Throughput | custom-vector-bulk | 1131.69 | docs/s |
| Max Throughput | custom-vector-bulk | 1167.02 | docs/s |
| 50th percentile latency | custom-vector-bulk | 39.0611 | ms |
| 90th percentile latency | custom-vector-bulk | 47.5916 | ms |
| 99th percentile latency | custom-vector-bulk | 83.2863 | ms |
| 99.9th percentile latency | custom-vector-bulk | 177.238 | ms |
| 99.99th percentile latency | custom-vector-bulk | 288.076 | ms |
| 100th percentile latency | custom-vector-bulk | 319.616 | ms |
| error rate | custom-vector-bulk | 0 | % |
Describe the solution you'd like
Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.
Related component
Indexing:Performance
Describe alternatives you've considered
In terms of alternative there is no alternative to disable the _recovery_source for an index.
FAQ
Q1: If a user needs vector field how they can retrieve the vector field? also Recovery source and _source is used for other purpose like update by query, disaster recovery etc etc, how we are going to support that?
In k-NN repo we are working on a PR which will ensure that we read k-NN vector field from doc values. Ref: opensearch-project/k-NN#1571. For other fields that require such capabilities can be added in core in incremental fashion if needed.
Appendix A
Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.
Flame graph when _source and _recovery_source is not getting stored.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status

