-
Couldn't load subscription status.
- Fork 130
Description
What is the bug?
When using the shrink action you will always receive the error "There are no available nodes to move to to execute a shrink. Delaying until node becomes available." It does not matter if I run this action on a cluster with different dedicated node types or a simple four-node cluster. All of the nodes have more than ample disk space available to perform a shrink action and pass all of the prerequisite shrink checks described in this plugin's github page.
How can one reproduce the bug?
Setup ISM and make "shrink" as one of the actions.
What is the expected behavior?
indices (in this case data streams) should be shrunk but will always receive the same above errors.
What is your host/environment?
- OS: Linux (Ubuntu)
- Version: 2.1.0
- Index Management (ISM)
Do you have any screenshots?
No
Do you have any additional context?
I looked through the plugin code and am pretty sure I've found where the bug occurs. When the plugin determines if nodes are eligible to perform a shrink the code computes how much free "space" there is on each node, but this free space is using the available RAM instead of available disk space on the host. Details are as follows:
The below code should be using the fs stats instead of the OS stats to determine the free disk space.
https://github.com/opensearch-project/index-management/blob/main/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/util/StepUtils.kt#L169
https://github.com/opensearch-project/index-management/blob/main/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/util/StepUtils.kt#L170
I've made the necessary modifications to StepUtils.kt and also to the AttemptShrinkStep.kt files but for the life of me I can't get the unit test to complete when doing a gradle build (build errors are below). I'm sure I'm missing something simple to push this over the finish line but I'm not a Kotlin developer and would appreciate any assistance in getting this bug fixed. Below are the code snippets changes I've made to the above files:
StepUtils.kt: Lines 166-179
// Use disk stats instead of OS RAM stats to determine if there is enough space.
fun getNodeFreeMemoryAfterShrink(node: NodeStats, indexSizeInBytes: Long, clusterSettings: ClusterSettings): Long {
val fsStats = node.fs
if (fsStats != null) {
val diskSpaceLeftInNode = fsStats.total.free.bytes
val totalNodeDisk = fsStats.total.total.bytes
val freeBytesThresholdHigh = getFreeBytesThresholdHigh(clusterSettings, totalNodeDisk)
// We require that a node has enough space to be below the high watermark disk level with an additional 2 * the index size free
val requiredBytes = (2 * indexSizeInBytes) + freeBytesThresholdHigh
if (diskSpaceLeftInNode > requiredBytes) {
return diskSpaceLeftInNode - requiredBytes
}
}
return -1L
}
AttemptMoveShardsStep.kt: Added line below 396
const val FS_METRIC = "fs"
AttemptShrinkStep.kt: Line 64
# Include FS stats in the NodeStats object.
val nodesStatsReq = NodesStatsRequest().addMetrics(AttemptMoveShardsStep.OS_METRIC, AttemptMoveShardsStep.FS_METRIC)
Build errors when attempting to do a gradle build with the above code changes:
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test basic workflow number of shards" -Dtests.seed=F971E0BC5598C247 -Dtests.security.manager=false -Dtests.locale=be -Dtests.timezone=America/Manaus -Druntime.java=18
org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT > test basic workflow number of shards FAILED
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([F971E0BC5598C247:40C658EE50CB266E]:0)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test basic workflow number of shards$2.invoke(ShrinkActionIT.kt:112)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test basic workflow number of shards$2.invoke(ShrinkActionIT.kt:111)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test basic workflow number of shards(ShrinkActionIT.kt:111)
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test retries from first step" -Dtests.seed=F971E0BC5598C247 -Dtests.security.manager=false -Dtests.locale=be -Dtests.timezone=America/Manaus -Druntime.java=18
org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT > test retries from first step FAILED
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([F971E0BC5598C247:5EAE08ADD0943CE6]:0)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test retries from first step$2.invoke(ShrinkActionIT.kt:623)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test retries from first step$2.invoke(ShrinkActionIT.kt:622)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test retries from first step(ShrinkActionIT.kt:622)
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test basic workflow percentage to decrease to" -Dtests.seed=F971E0BC5598C247 -Dtests.security.manager=false -Dtests.locale=be -Dtests.timezone=America/Manaus -Druntime.java=18
org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT > test basic workflow percentage to decrease to FAILED
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([F971E0BC5598C247:9F1262B750C9F591]:0)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test basic workflow percentage to decrease to$2.invoke(ShrinkActionIT.kt:293)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT$test basic workflow percentage to decrease to$2.invoke(ShrinkActionIT.kt:292)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.indexstatemanagement.action.ShrinkActionIT.test basic workflow percentage to decrease to(ShrinkActionIT.kt:292)