Fix memory overflow caused by cache behavior #2015

kotwanikunal · 2024-08-30T04:08:31Z

Description

Adds in a force evict mechanism before entry to prevent memory overflow for native memory usage within the cache
This solution works in tandem with jemalloc enabled. In case of default malloc, there are potential delays with memory free up operations leading to similar scenarios
The solution has been tested and benchmarked - described in the Testing section below

Problem

The problem stems from the behavior of the cache where in entries of the cache might not be cleaned up as soon as they are evictable
This delay between new entry and cleanup can, in our case, cause the memory to overflow - leading to a node failure

Solution

This solution aims at maintaining an additional recency list within the cache manager
On every get call -
- in case of a cache hit - it just marks the key as recent, returns the value
- in case of a cache miss - it adds the key to the recency list, loads the item into the cache and returns
- in case of a cache miss & a potential overflow - eviction process kicks off before the item is loaded into the cache

Testing

Case: Enough memory available

Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Mean recall@k	Mean recall@1
Current Code	768	1,000,000	2.46	2.68	2.69	2.74	358.268	382.036	414.636	482.332	358.268	382.036	414.636	482.332	0.98	1
Code with Fix	768	1,000,000	2.55	2.71	2.7	2.75	363.62	398.338	432.811	447.278	363.62	398.338	432.811	447.278	0.98	1

Case: Heavy query load scenario

Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Error rate %	Mean recall@k	Mean recall@1
Index1 with fix	768	1,000,000	0.05	0.06	0.06	0.09	13228.9	19363.3	24602.8	35710.1	13228.9	19363.3	24602.8	35710.1	11	0.99	1
Index2 with fix	768	1,000,000	0.1	0.25	0.14	1.02	11728.5	15634.2	23228.2	34283.1	11728.5	15634.2	23228.2	34283.1	0	0.98	1
Index3 with fix	768	1,000,000	0.06	0.07	0.07	0.09	12575	17641.8	25928.4	27722.8	12575	17641.8	25928.4	27722.8	3	0.98	1

Related Issues

Partially resolves #1582 (Case 3)

Check List

New functionality includes testing.
New functionality has been documented.
~~API changes companion pull request created.~~
Commits are signed per the DCO using --signoff.
~~Public documentation issue/PR created.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java

src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java

src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java

src/main/java/org/opensearch/knn/common/featureflags/KNNFeatureFlags.java

src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java

kotwanikunal · 2024-09-03T17:27:55Z

Original Test results:

Case: Enough memory available

Scenario description: Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran an OSB query load with and without the fix
TLDR; Not a ton of deviation from current code in terms of raw numbers. Results below

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Mean recall@k	Mean recall@1
Current Code	768	1,000,000	2.46	2.68	2.69	2.74	358.268	382.036	414.636	482.332	358.268	382.036	414.636	482.332	0.98	1
Code with Fix	768	1,000,000	2.55	2.71	2.7	2.75	363.62	398.338	432.811	447.278	363.62	398.338	432.811	447.278	0.98	1

Case: Heavy query load scenario

Scenario: Created 3 indexes (2 FAISS, 1 NMSLIB), Indexed 1M 768d vectors on a 3 node cluster with 16GB memory - 8GB heap, 8GB for graph files, ran 3 parallel query workloads for all the indexes at once.
TLDR; Without the fix - noticed node crashes. With the fix - there was an uptick in error due to timeouts, latency increased but no node crash was observed

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	50th percentile service time	90th percentile service time	99th percentile service time	100th percentile service time	Error rate %	Mean recall@k	Mean recall@1
Index1 with fix	768	1,000,000	0.05	0.06	0.06	0.09	13228.9	19363.3	24602.8	35710.1	13228.9	19363.3	24602.8	35710.1	11	0.99	1
Index2 with fix	768	1,000,000	0.1	0.25	0.14	1.02	11728.5	15634.2	23228.2	34283.1	11728.5	15634.2	23228.2	34283.1	0	0.98	1
Index3 with fix	768	1,000,000	0.06	0.07	0.07	0.09	12575	17641.8	25928.4	27722.8	12575	17641.8	25928.4	27722.8	3	0.98	1

Ran some tests for LinkedHashSet vs the current queue implementation. Results:

Index Type	Dimension	Number Of Vectors	Min Throughput	Mean Throughput	Median Throughput	Max Throughput	50th percentile latency	90th percentile latency	99th percentile latency	100th percentile latency	Error rate %	Mean recall@k	Mean recall@1
Single Query Queue	768	1,000,000	1.28	2.37	2.46	2.58	368.768	417.358	483.694	761.651	0	0.98	0.97
Single Query Sync Set	768	1,000,000	1.1	1.97	2.07	2.3	407.133	516.032	658.99	889.347	0	0.98	0.97

Parallel Query Queue	768	1,000,000	0.07	0.07	0.07	0.09	13465.6	18455.7	31858	33136.6	0	0.99	1
Parallel Query Sync Set	768	1,000,000	0.06	0.07	0.06	0.08	13937	18936.5	28190.5	28804.4	8	0.99	1

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com>

src/main/java/org/opensearch/knn/common/featureflags/KNNFeatureFlags.java

navneet1v

Approving the PR. I do understand that by adding updating element in the queue is not fully optimized but as the solution provides more stability(in memory constraint situation)I am good with the approach.

I would suggest turning on the feature flag so that we can see the impact of this in the nightlies. But I will leave this upto the author.

heemin32

Approving it as this is behind feature flag and it is false by default. We might need a test with higher number of segments.

opensearch-trigger-bot · 2024-09-04T04:04:30Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2015-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d4af93edfb45878fdb0a442f4ac07e091b5ad14b
# Push it to GitHub
git push --set-upstream origin backport/backport-2015-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2015-to-2.x.

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e)

kotwanikunal · 2024-09-04T04:15:25Z

Backport: #2032

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e)

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e) (cherry picked from commit 5858d1c)

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e) (cherry picked from commit 5858d1c) Co-authored-by: Kunal Kotwani <kkotwani@amazon.com>

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Akash Shankaran <akash.shankaran@intel.com>

kotwanikunal requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan and luyuncheng as code owners August 30, 2024 04:08

kotwanikunal force-pushed the cache-overflow-fix branch 2 times, most recently from 8e7b4bd to 98f337c Compare August 30, 2024 04:15

heemin32 reviewed Aug 30, 2024

View reviewed changes

jmazanec15 reviewed Aug 30, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/memory/NativeMemoryAllocation.java Show resolved Hide resolved

jmazanec15 reviewed Aug 30, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java Show resolved Hide resolved

heemin32 reviewed Aug 30, 2024

View reviewed changes

src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java Show resolved Hide resolved

navneet1v reviewed Sep 3, 2024

View reviewed changes

navneet1v added Enhancements Increases software capabilities beyond original client specifications search-improvements labels Sep 3, 2024

navneet1v requested a review from heemin32 September 3, 2024 00:28

navneet1v added the v2.17.0 label Sep 3, 2024

Fix memory overflow caused by cache behavior

ab31875

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com>

kotwanikunal force-pushed the cache-overflow-fix branch from 98f337c to ab31875 Compare September 3, 2024 20:10

navneet1v reviewed Sep 3, 2024

View reviewed changes

src/main/java/org/opensearch/knn/common/featureflags/KNNFeatureFlags.java Show resolved Hide resolved

navneet1v approved these changes Sep 3, 2024

View reviewed changes

heemin32 approved these changes Sep 4, 2024

View reviewed changes

navneet1v merged commit d4af93e into opensearch-project:main Sep 4, 2024
28 of 29 checks passed

navneet1v added the backport 2.x label Sep 4, 2024

kotwanikunal added a commit to kotwanikunal/k-NN that referenced this pull request Sep 4, 2024

Fix memory overflow caused by cache behavior (opensearch-project#2015)

5707cd0

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e)

kotwanikunal mentioned this pull request Sep 4, 2024

[Backport 2.x] Fix memory overflow caused by cache behavior (#2015) #2032

Merged

5 tasks

naveentatikonda pushed a commit that referenced this pull request Sep 4, 2024

Fix memory overflow caused by cache behavior (#2015) (#2032)

5858d1c

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e)

opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 4, 2024

Fix memory overflow caused by cache behavior (#2015) (#2032)

6547a3c

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> (cherry picked from commit d4af93e) (cherry picked from commit 5858d1c)

akashsha1 pushed a commit to akashsha1/k-NN that referenced this pull request Sep 16, 2024

Fix memory overflow caused by cache behavior (opensearch-project#2015)

c3db30b

Signed-off-by: Kunal Kotwani <kkotwani@amazon.com> Signed-off-by: Akash Shankaran <akash.shankaran@intel.com>

kotwanikunal mentioned this pull request Dec 4, 2024

Add thread to periodically perform pending cache maintenance #2308

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory overflow caused by cache behavior #2015

Fix memory overflow caused by cache behavior #2015

kotwanikunal commented Aug 30, 2024

kotwanikunal commented Sep 3, 2024 •

edited

Loading

navneet1v left a comment

heemin32 left a comment

opensearch-trigger-bot bot commented Sep 4, 2024

kotwanikunal commented Sep 4, 2024

Fix memory overflow caused by cache behavior #2015

Fix memory overflow caused by cache behavior #2015

Conversation

kotwanikunal commented Aug 30, 2024

Description

Problem

Solution

Testing

Related Issues

Check List

kotwanikunal commented Sep 3, 2024 • edited Loading

navneet1v left a comment

Choose a reason for hiding this comment

heemin32 left a comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Sep 4, 2024

kotwanikunal commented Sep 4, 2024

kotwanikunal commented Sep 3, 2024 •

edited

Loading