Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Negative memory causes _cat APIs to fail #4474

Closed
anandpatel9998 opened this issue Sep 9, 2022 · 7 comments · Fixed by #6917
Closed

[BUG] Negative memory causes _cat APIs to fail #4474

anandpatel9998 opened this issue Sep 9, 2022 · 7 comments · Fixed by #6917
Assignees
Labels
bug Something isn't working distributed framework

Comments

@anandpatel9998
Copy link

Describe the bug
_cat/indices API is failing with below error:

{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
}
],
"type" : "illegal_argument_exception",
"reason" : "Values less than -1 bytes are not supported: -9223372036853731881b"
},
"status" : 400
}

Found similar issue in Elasticsearch repo elastic/elasticsearch#55434

To Reproduce
Don't have a clean way to reproduce this.

Expected behavior
_cat APIs should fail with the error described.

@anandpatel9998 anandpatel9998 added bug Something isn't working untriaged labels Sep 9, 2022
@dbwiddis
Copy link
Member

The root cause is the calculation here:

public QueryCacheStats getStats(ShardId shard) {
final Map<ShardId, QueryCacheStats> stats = new HashMap<>();
for (Map.Entry<ShardId, Stats> entry : shardStats.entrySet()) {
stats.put(entry.getKey(), entry.getValue().toQueryCacheStats());
}
QueryCacheStats shardStats = new QueryCacheStats();
QueryCacheStats info = stats.get(shard);
if (info == null) {
info = new QueryCacheStats();
}
shardStats.add(info);
// We also have some shared ram usage that we try to distribute to
// proportionally to their number of cache entries of each shard
long totalSize = 0;
for (QueryCacheStats s : stats.values()) {
totalSize += s.getCacheSize();
}
final double weight = totalSize == 0 ? 1d / stats.size() : ((double) shardStats.getCacheSize()) / totalSize;
final long additionalRamBytesUsed = Math.round(weight * sharedRamBytesUsed);
shardStats.add(new QueryCacheStats(additionalRamBytesUsed, 0, 0, 0, 0));
return shardStats;
}

In the edge case where shardStats is an empty map, the stats map is also empty (stats.size() == 0). However this case also leads to a zero value for totalSize, which gives an infinite weight which rounds to the max long value for additionalRamBytesUsed for the shard.

Then later in calculating total memory, the max long value (line 509) is added to other fields, resulting in overflow.

public ByteSizeValue getTotalMemory() {
long size = 0;
if (this.getFieldData() != null) {
size += this.getFieldData().getMemorySizeInBytes();
}
if (this.getQueryCache() != null) {
size += this.getQueryCache().getMemorySizeInBytes();
}
if (this.getSegments() != null) {
size += this.getSegments().getIndexWriterMemoryInBytes() + this.getSegments().getVersionMapMemoryInBytes();
}
return new ByteSizeValue(size);
}

Based on the comment about distributing shared ram usage, I suspect that the value of additionalRamBytesUsed should be zero in the case the stats map is empty.

@dblock
Copy link
Member

dblock commented Sep 13, 2022

Good debugging @dbwiddis :) let’s see a fix?

@dbwiddis
Copy link
Member

let’s see a fix?

Alas, while I can backtrace call hierarchies like a pro, I'm not at all familiar with this section of the code. Happy to submit a PR if someone can tell me the best resolution. Always 0? Max out the weight as 1? Some other solution? Is the empty map an indication of a deeper bug elsewhere?

@reta
Copy link
Collaborator

reta commented Sep 16, 2022

@dbwiddis I think your last note about the deeper bug elsewhere is likely a cause. From what I can tell, in this particular flow there are shards present (hence the call to getStats(ShardId shard)), however from the shardStats cache perspective, it seems like there are no shards at all (it is empty).

@dbwiddis
Copy link
Member

While fixing this I believe I isolated the root cause:

// This cache stores two things: filters, and doc id sets. Calling
// clear only removes the doc id sets, but if we reach the situation
// that the cache does not contain any DocIdSet anymore, then it
// probably means that the user wanted to remove everything.
if (cache.getCacheSize() == 0) {
cache.clear();
}

We are in a situation where all documents have been removed from the shard but the cache still retains cached filters from previous queries. This comment (or a variant) appears in clearIndex() and close() which prompts clearing the cache, but if a user uses remove() and gets it to empty state, there is still some memory assigned. To release that memory and clear the (empty) cache, one of those two methods should be called.

@dbwiddis
Copy link
Member

dbwiddis commented Apr 2, 2023

To Reproduce
Don't have a clean way to reproduce this.

A reproducing test case already existed in o.o.i.IndicesQueryCacheTests.testBasics(), but total memory was not being tested. It has a value of Long.MAX_VALUE after the last portion (onClose(), before close()).

@mhoffm-aiven
Copy link

mhoffm-aiven commented Aug 3, 2023

Just encountered this on a cluster running 3 opensearch version 1.3.11 nodes and one 1.3.8 node ( in middle of failover to the newer version) ; clearing caches fixed the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants