[Feature Request] Capability to remove _recovery_source per field

### Is your feature request related to a problem? Please describe

While doing the indexing all the fields which are getting ingested in Opensearch are stored as _source. If user requires they can disable the _source per field or completely for all the fields. But if user does this, the \_recovery_source gets added([ref](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L215-L220)), which gets removed later on.


So overall the whole payload will still be used as a StoredField and impacts the indexing time. The impact on indexing time is high if one of the field is a vector field. In my experiments with 768D 1M dataset I can see a 50% reduction in indexing latency at p90 level.  


### Benchmarking Results
Below are the benchmarking results. Refer Appendix A for the flame graphs

#### Cluster Configuration

Data Nodes | r5.2xlarge
-- | --
Data Nodes Count | 3
Dataset | cohere-768D
Number of Vectors | 1M
Tool | OSB
Indexing Clients | 1
Number of Shards | 3
Index Thread Qty | 1
Engine | Nmslib
EF Construction | 256
EF Search | 256

#### Baseline Results
|                                                         Metric |                 Task |       Value |   Unit |                                                                                                   
|---------------------------------------------------------------:|---------------------:|------------:|-------:|                                                                                                   
|                     Cumulative indexing time of primary shards |                      |      6.3266 |    min |                                                                                                   
|             Min cumulative indexing time across primary shards |                      |           0 |    min |                                                                                                   
|          Median cumulative indexing time across primary shards |                      |     1.99598 |    min |                                                                                                   
|             Max cumulative indexing time across primary shards |                      |     2.27533 |    min |                                                                                                   
|            Cumulative indexing throttle time of primary shards |                      |           0 |    min |                                                                                                   
|    Min cumulative indexing throttle time across primary shards |                      |           0 |    min |                                                                                                   
| Median cumulative indexing throttle time across primary shards |                      |           0 |    min |                                                                                                   
|    Max cumulative indexing throttle time across primary shards |                      |           0 |    min |                                                                                                   
|                        Cumulative merge time of primary shards |                      |     45.0456 |    min |                                                                                                   
|                       Cumulative merge count of primary shards |                      |           7 |        |                                                                                                   
|                Min cumulative merge time across primary shards |                      |           0 |    min |                                                                                                   
|             Median cumulative merge time across primary shards |                      |     14.4361 |    min |                                                                                                   
|                Max cumulative merge time across primary shards |                      |     15.8919 |    min |                                                                                                   
|               Cumulative merge throttle time of primary shards |                      |    0.168133 |    min |                                                                                                   
|       Min cumulative merge throttle time across primary shards |                      |           0 |    min |                                                                                                   
|    Median cumulative merge throttle time across primary shards |                      |   0.0523167 |    min |                                                                                                   
|       Max cumulative merge throttle time across primary shards |                      |   0.0632667 |    min |                                                                                                   
|                      Cumulative refresh time of primary shards |                      |      6.0824 |    min |                                                                                                   
|                     Cumulative refresh count of primary shards |                      |         144 |        |                                                                                                   
|              Min cumulative refresh time across primary shards |                      |           0 |    min |                                                                                                   
|           Median cumulative refresh time across primary shards |                      |     1.90423 |    min |                                                                                                   
|              Max cumulative refresh time across primary shards |                      |     2.14323 |    min |                                                                                                   
|                        Cumulative flush time of primary shards |                      |     23.2522 |    min |                                                                                                   
|                       Cumulative flush count of primary shards |                      |          32 |        |                                                                                                   
|                Min cumulative flush time across primary shards |                      |           0 |    min |                                                                                                   
|             Median cumulative flush time across primary shards |                      |      7.4859 |    min |                                                                                                   
|                Max cumulative flush time across primary shards |                      |     8.08212 |    min |                                                                                                   
|                                        Total Young Gen GC time |                      |       0.927 |      s |                                                                                                   
|                                       Total Young Gen GC count |                      |          39 |        |                                                                                                   
|                                          Total Old Gen GC time |                      |           0 |      s |                                                                                                   
|                                         Total Old Gen GC count |                      |           0 |        |                                                                                                   
|                                                     Store size |                      |     12.4152 |     GB |                                                                                                   
|                                                  Translog size |                      |  5.6345e-07 |     GB |                                                                                                                                                                                                                                                                          
|                                                 Min Throughput |   custom-vector-bulk |      885.98 | docs/s |                                                                                                                                                                                                              
|                                                Mean Throughput |   custom-vector-bulk |     1017.74 | docs/s |                                                                                                                                                                                                              
|                                              Median Throughput |   custom-vector-bulk |     1007.87 | docs/s |                                                                                                                                                                                                              
|                                                 Max Throughput |   custom-vector-bulk |     1082.07 | docs/s |                                                                                                                                                                                                              
|                                        50th percentile latency |   custom-vector-bulk |     44.5227 |     ms |                                                                                                                                                                                                              
|                                        90th percentile latency |   custom-vector-bulk |     59.4799 |     ms |                                                                                                                                                                                                              
|                                        99th percentile latency |   custom-vector-bulk |     185.057 |     ms |                                                                                                                                                                                                              
|                                      99.9th percentile latency |   custom-vector-bulk |      335.08 |     ms |                                                                                                                                                                                                              
|                                     99.99th percentile latency |   custom-vector-bulk |     490.821 |     ms |                                                                                                                                                                                                              
|                                       100th percentile latency |   custom-vector-bulk |     545.323 |     ms |                                                                                                                                                                                                              |                                                     error rate |   custom-vector-bulk |           0 |      % | 


### Removing Recovering Source and _source 

POC code: https://github.com/navneet1v/OpenSearch/commit/6c5896a69dd44b3076e9211e2e98ff87cc29ee65

|                                                         Metric |               Task |    Value |   Unit |
|---------------------------------------------------------------:|-------------------:|---------:|-------:|
|                     Cumulative indexing time of primary shards |                    |  5.49067 |    min |
|             Min cumulative indexing time across primary shards |                    |  1.71528 |    min |
|          Median cumulative indexing time across primary shards |                    |  1.77818 |    min |
|             Max cumulative indexing time across primary shards |                    |   1.9972 |    min |
|            Cumulative indexing throttle time of primary shards |                    |        0 |    min |
|    Min cumulative indexing throttle time across primary shards |                    |        0 |    min |
| Median cumulative indexing throttle time across primary shards |                    |        0 |    min |
|    Max cumulative indexing throttle time across primary shards |                    |        0 |    min |
|                        Cumulative merge time of primary shards |                    |   0.3878 |    min |
|                       Cumulative merge count of primary shards |                    |        3 |        |
|                Min cumulative merge time across primary shards |                    | 0.118133 |    min |
|             Median cumulative merge time across primary shards |                    | 0.133017 |    min |
|                Max cumulative merge time across primary shards |                    |  0.13665 |    min |
|               Cumulative merge throttle time of primary shards |                    |        0 |    min |
|       Min cumulative merge throttle time across primary shards |                    |        0 |    min |
|    Median cumulative merge throttle time across primary shards |                    |        0 |    min |
|       Max cumulative merge throttle time across primary shards |                    |        0 |    min |
|                      Cumulative refresh time of primary shards |                    |    13.38 |    min |
|                     Cumulative refresh count of primary shards |                    |       86 |        |
|              Min cumulative refresh time across primary shards |                    |   4.0509 |    min |
|           Median cumulative refresh time across primary shards |                    |  4.11963 |    min |
|              Max cumulative refresh time across primary shards |                    |  5.20948 |    min |
|                        Cumulative flush time of primary shards |                    |  31.9997 |    min |
|                       Cumulative flush count of primary shards |                    |       27 |        |
|                Min cumulative flush time across primary shards |                    |  10.0862 |    min |
|             Median cumulative flush time across primary shards |                    |  10.2099 |    min |
|                Max cumulative flush time across primary shards |                    |  11.7035 |    min |
|                                                     Store size |                    |  11.7547 |     GB |
|                                                  Translog size |                    |  1.19775 |     GB |
|                                                 Min Throughput | custom-vector-bulk |   932.63 | docs/s |
|                                                Mean Throughput | custom-vector-bulk |  1133.94 | docs/s |
|                                              Median Throughput | custom-vector-bulk |  1131.69 | docs/s |
|                                                 Max Throughput | custom-vector-bulk |  1167.02 | docs/s |
|                                        50th percentile latency | custom-vector-bulk |  39.0611 |     ms |
|                                        90th percentile latency | custom-vector-bulk |  47.5916 |     ms |
|                                        99th percentile latency | custom-vector-bulk |  83.2863 |     ms |
|                                      99.9th percentile latency | custom-vector-bulk |  177.238 |     ms |
|                                     99.99th percentile latency | custom-vector-bulk |  288.076 |     ms |
|                                       100th percentile latency | custom-vector-bulk |  319.616 |     ms |
|                                                     error rate | custom-vector-bulk |        0 |      % |





### Describe the solution you'd like

Just like _source where we can specify what fields are included/excluded in _source or completely disable _source, I was thinking to have same capability for _recovery_source. This will ensure that users can remove their fields from _recovery source if required.

### Related component

Indexing:Performance

### Describe alternatives you've considered

In terms of alternative there is no alternative to disable the \_recovery_source for an index.

### FAQ
Q1: If a user needs vector field how they can retrieve the vector field? also Recovery source and _source is used for other purpose like update by query, disaster recovery etc etc, how we are going to support that?

In k-NN repo we are working on a PR which will ensure that we read k-NN vector field from doc values. Ref: https://github.com/opensearch-project/k-NN/pull/1571. For other fields that require such capabilities can be added in core in incremental fashion if needed. 

## Appendix A
Flame graph when _source/_recovery_source is getting stored during indexing for vector fields.

![320730815-7752c503-1b97-42e7-885b-0dc505e7a43f](https://github.com/opensearch-project/OpenSearch/assets/6162415/4eaaff4b-0d70-48b2-8bcb-06317c9112d4)


Flame graph when _source and _recovery_source is not getting stored.

![320730873-877fced6-6e67-4199-b939-d3ef48570a94](https://github.com/opensearch-project/OpenSearch/assets/6162415/8b74b331-83ef-488f-842f-4b5400ef9c0c)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature Request] Capability to remove _recovery_source per field #13490

Is your feature request related to a problem? Please describe

Benchmarking Results

Cluster Configuration

Baseline Results

Removing Recovering Source and _source

Describe the solution you'd like

Related component

Describe alternatives you've considered

FAQ

Appendix A

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data Nodes	r5.2xlarge
Data Nodes Count	3
Dataset	cohere-768D
Number of Vectors	1M
Tool	OSB
Indexing Clients	1
Number of Shards	3
Index Thread Qty	1
Engine	Nmslib
EF Construction	256
EF Search	256

Metric	Task	Value	Unit
Cumulative indexing time of primary shards		6.3266	min
Min cumulative indexing time across primary shards		0	min
Median cumulative indexing time across primary shards		1.99598	min
Max cumulative indexing time across primary shards		2.27533	min
Cumulative indexing throttle time of primary shards		0	min
Min cumulative indexing throttle time across primary shards		0	min
Median cumulative indexing throttle time across primary shards		0	min
Max cumulative indexing throttle time across primary shards		0	min
Cumulative merge time of primary shards		45.0456	min
Cumulative merge count of primary shards		7
Min cumulative merge time across primary shards		0	min
Median cumulative merge time across primary shards		14.4361	min
Max cumulative merge time across primary shards		15.8919	min
Cumulative merge throttle time of primary shards		0.168133	min
Min cumulative merge throttle time across primary shards		0	min
Median cumulative merge throttle time across primary shards		0.0523167	min
Max cumulative merge throttle time across primary shards		0.0632667	min
Cumulative refresh time of primary shards		6.0824	min
Cumulative refresh count of primary shards		144
Min cumulative refresh time across primary shards		0	min
Median cumulative refresh time across primary shards		1.90423	min
Max cumulative refresh time across primary shards		2.14323	min
Cumulative flush time of primary shards		23.2522	min
Cumulative flush count of primary shards		32
Min cumulative flush time across primary shards		0	min
Median cumulative flush time across primary shards		7.4859	min
Max cumulative flush time across primary shards		8.08212	min
Total Young Gen GC time		0.927	s
Total Young Gen GC count		39
Total Old Gen GC time		0	s
Total Old Gen GC count		0
Store size		12.4152	GB
Translog size		5.6345e-07	GB
Min Throughput	custom-vector-bulk	885.98	docs/s
Mean Throughput	custom-vector-bulk	1017.74	docs/s
Median Throughput	custom-vector-bulk	1007.87	docs/s
Max Throughput	custom-vector-bulk	1082.07	docs/s
50th percentile latency	custom-vector-bulk	44.5227	ms
90th percentile latency	custom-vector-bulk	59.4799	ms
99th percentile latency	custom-vector-bulk	185.057	ms
99.9th percentile latency	custom-vector-bulk	335.08	ms
99.99th percentile latency	custom-vector-bulk	490.821	ms
100th percentile latency	custom-vector-bulk	545.323	ms

Metric	Task	Value	Unit
Cumulative indexing time of primary shards		5.49067	min
Min cumulative indexing time across primary shards		1.71528	min
Median cumulative indexing time across primary shards		1.77818	min
Max cumulative indexing time across primary shards		1.9972	min
Cumulative indexing throttle time of primary shards		0	min
Min cumulative indexing throttle time across primary shards		0	min
Median cumulative indexing throttle time across primary shards		0	min
Max cumulative indexing throttle time across primary shards		0	min
Cumulative merge time of primary shards		0.3878	min
Cumulative merge count of primary shards		3
Min cumulative merge time across primary shards		0.118133	min
Median cumulative merge time across primary shards		0.133017	min
Max cumulative merge time across primary shards		0.13665	min
Cumulative merge throttle time of primary shards		0	min
Min cumulative merge throttle time across primary shards		0	min
Median cumulative merge throttle time across primary shards		0	min
Max cumulative merge throttle time across primary shards		0	min
Cumulative refresh time of primary shards		13.38	min
Cumulative refresh count of primary shards		86
Min cumulative refresh time across primary shards		4.0509	min
Median cumulative refresh time across primary shards		4.11963	min
Max cumulative refresh time across primary shards		5.20948	min
Cumulative flush time of primary shards		31.9997	min
Cumulative flush count of primary shards		27
Min cumulative flush time across primary shards		10.0862	min
Median cumulative flush time across primary shards		10.2099	min
Max cumulative flush time across primary shards		11.7035	min
Store size		11.7547	GB
Translog size		1.19775	GB
Min Throughput	custom-vector-bulk	932.63	docs/s
Mean Throughput	custom-vector-bulk	1133.94	docs/s
Median Throughput	custom-vector-bulk	1131.69	docs/s
Max Throughput	custom-vector-bulk	1167.02	docs/s
50th percentile latency	custom-vector-bulk	39.0611	ms
90th percentile latency	custom-vector-bulk	47.5916	ms
99th percentile latency	custom-vector-bulk	83.2863	ms
99.9th percentile latency	custom-vector-bulk	177.238	ms
99.99th percentile latency	custom-vector-bulk	288.076	ms
100th percentile latency	custom-vector-bulk	319.616	ms
error rate	custom-vector-bulk	0	%

Uh oh!

[Feature Request] Capability to remove _recovery_source per field #13490

Description

Is your feature request related to a problem? Please describe

Benchmarking Results

Cluster Configuration

Baseline Results

Removing Recovering Source and _source

Describe the solution you'd like

Related component

Describe alternatives you've considered

FAQ

Appendix A

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions