Skip to content

[CASSANDRA-20457] Limit number of OOM heap dumps to avoid full disk issue #4166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

gowa
Copy link

@gowa gowa commented May 19, 2025

New env vars are introduced:
CASSANDRA_HEAPDUMP_CLEAN:
1 - clean (default)
0 - do not clean
CASSANDRA_HEAPDUMP_KEEP_LAST_N_FILES
-1 - no heap dump files will be removed
0 - all heap dump files will be removed
N > 0 - up to N most recent files will be kept
default: 2
CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES
-1 - no heap dump files will be removed
0 - all heap dump files will be removed
N > 0 - up to N least recent files will be kept
default: 1

test script

#!/bin/sh

export CASSANDRA_HOME="."
export CASSANDRA_HEAPDUMP_DIR=test_heapdumps

create_hprof_files()
{
  rm -rf test_heapdumps
  mkdir -p test_heapdumps
  if [ ! -d test_heapdumps ]; then
      echo "was not able to create dir test_heapdumps"
  fi

  for i in $(seq 1 1000); do echo 1 > "test_heapdumps/cassandra-$i-pid$i.hprof"; sleep 0.005; done;
  echo 1 > "test_heapdumps/cassandra-1-pid1.hprofX"
  echo 1 > "test_heapdumps/Xcassandra-1-pid1.hprof"
  mkdir -p "test_heapdumps/cassandra-1001-pid1001.hprof"
  if [ ! -d "test_heapdumps/cassandra-1001-pid1001.hprof" ]; then
      echo "was not able to create dir test_heapdumps/cassandra-1001-pid1001.hprof"
  fi
  if ! ls -pt1 test_heapdumps | wc -l | awk '{if ($0 != 1003) exit 1}'; then
    echo "wrong content created under test_heapdumps"
  fi
}

(
  create_hprof_files
  . ./cassandra-env.sh
  clean_heap_dump_files
  if ! ls -pt1 test_heapdumps | wc -l | awk '{if ($0 != 3+3) { print $0; exit 1; }}'; then
    echo "ERROR: 3+3 files/dirs should have left under test_heapdumps"
  fi
  ls -pt1 test_heapdumps
)

(
  create_hprof_files
  CASSANDRA_HEAPDUMP_CLEAN=0
  . ./cassandra-env.sh
  clean_heap_dump_files
  if ! ls -pt1 test_heapdumps | wc -l | awk '{if ($0 != 1003) { print $0; exit 1; }}'; then
    echo "ERROR: 1003 files/dirs should have left under test_heapdumps"
  fi
  ls -pt1 test_heapdumps
)

(
  create_hprof_files
  CASSANDRA_HEAPDUMP_CLEAN=1
  CASSANDRA_HEAPDUMP_KEEP_LAST_N_FILES=3
  CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES=3
  . ./cassandra-env.sh
  clean_heap_dump_files
  if ! ls -pt1 test_heapdumps | wc -l | awk '{if ($0 != 3+6) { print $0; exit 1; }}'; then
    echo "ERROR: 3+6 files/dirs should have left under test_heapdumps"
  fi
  ls -pt1 test_heapdumps
)

(
  if [ "x$clean_heap_dump_files_defined" = "x1" ]; then
    echo "error: calling clean_heap_dump_files, not expected"
    clean_heap_dump_files
  fi
)

rm -rf test_heapdumps

…ssue

New env vars are introduced:
CASSANDRA_HEAPDUMP_CLEAN:
1 - clean (default)
otherwise - do not clean

CASSANDRA_HEAPDUMP_KEEP_LAST_N_FILES
-1 - no heap dump files will be removed
0 - all heap dump files will be removed
N > 0 - up to N most recent files will be kept
default: 2

CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES
-1 - no heap dump files will be removed
0 - all heap dump files will be removed
N > 0 - up to N least recent files will be kept
default: 1

Currently, only 100 heap dump files are being searched for,
therefore having CASSANDRA_HEAPDUMP_KEEP_LAST_N_FILES and
CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES greater than 100 makes no sense.
@gowa gowa force-pushed the CASSANDRA-20457-trunk branch from 056692e to af5b7d3 Compare May 20, 2025 22:12
@gowa gowa changed the title WIP [CASSANDRA-20457] Limit number of OOM heap dumps to avoid full disk issue [CASSANDRA-20457] Limit number of OOM heap dumps to avoid full disk issue May 21, 2025
find "$CASSANDRA_HEAPDUMP_DIR" -name "cassandra-*-pid*.hprof" -type f | \
head -n 100 | \
xargs ls -t1 2>/dev/null | \
awk "BEGIN{ f=0; }{ files[f]=\$0; f+=1; }END{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a simpler way to do this. You can use tail -n and head -n or similar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smiklosovic ! Thanks for your review. Will try to address your comments properly.
I decided to rely on awk because it is used already in this script and because the solution using head/tail is not portable (at least chatgpt reported that head's negative line numbers cannot be used on MacOS, while some switch on uname in the script above suggests it should be portable), so I decided to go with awk using its basics only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept awk, removed find, xargs, head -n. using ls -pt1 as suggested.

Copy link
Contributor

@smiklosovic smiklosovic Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we simplify this patch to just one property as suggested above, that is "keep newest n and remove the rest" is not it true that awk will not be necessary anymore and we can do with head -n N, N being number of newest files to keep? head -n is portable.

CASSANDRA_HEAPDUMP_KEEP_LAST_N_FILES=2
fi
if [ "x$CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES" = "x" ]; then
CASSANDRA_HEAPDUMP_KEEP_FIRST_N_FILES=1
Copy link
Contributor

@smiklosovic smiklosovic Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gowa Why do we even have this property? "keep newest n files" makes sense to me, why would we want to keep oldest n files? Under what circumstance an operator would want to keep oldest around and remove the newest? Don't you think that keeping around just the latest one (or few latest ones) make more sense? This complicates the logic unnecessarily.

I would also rename the properties to have "oldest / newest" in them. I am not sure if "last" and "first" was understood correctly here. "last" means as newest or oldest? It would be cool to know how properties are functioning without consulting the documentation above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smiklosovic, in my experience it was often important to see and analyze the first occurrence of the issue, so you know the time and can think of what has changed in the process and led to OOM, while reoccurrence often happens 'automatically'.
It is a heuristic anyway. I suggested this property just based on my gut feeling. I felt it would be beneficial to have a very first heap dump in series.

It can be removed and awk can be eliminated. I suggested an option. If it does not look right, I will drop it.

Agree about last/first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@netudima you thought about having two properties to keep both newest / oldest?

@@ -188,6 +188,10 @@ launch_service()
cassandra_parms="$cassandra_parms -Dcassandra-pidfile=$pidpath"
fi

if [ "x$clean_heap_dump_files_defined" = "x1" ]; then
Copy link
Contributor

@smiklosovic smiklosovic Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gowa why don't you just remove x from both sides? I have never seen this construct. I would also rename it to should_clean_heap_dump_files instead of _defined suffix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants