Skip to content

Add automatic crash detection and restarts#257

Closed
michael2847 wants to merge 3 commits intoMinecraftServerControl:mainfrom
michael2847:main
Closed

Add automatic crash detection and restarts#257
michael2847 wants to merge 3 commits intoMinecraftServerControl:mainfrom
michael2847:main

Conversation

@michael2847
Copy link
Member

Automatic crash detection and restarts

; Enable the option to restart the server after a crash is detected.
; 0 - Restart after crash (default).
; 1 - Do not restart after crash.
# mscs-enable-restart-after-crash=0

This is a preliminary PR for #225 that enables the option to restart worlds after a crash is detected using the mscs-enable-restart-after-crash flag (via mscs.defaults). I set the default to be 0 (true), but we can change the default to be 1 (false) if you think that's better. Additionally, we could add the functionality for it to be available on a per world basis via mscs.properties if you think that's a good idea as well.

How it works:

  1. When a world is started, the startServerMonitor() function is called. This checks if the mscs-enable-restart-after-crash flag is enabled. If so, it checks if there is another instance of the server monitor already running (which would occur if the server was restarted after a crash). To do this check, it uses flock. I tried for several hours to use the createLockFile() function, but it didn't work for me for whatever reason--not sure if it was because I was doing something wrong or if it was just an issue with the function itself. @sandain If you're able to get the createLockFile() function working instead of using flock, that'd be good so we don't have to use flock as a dependency.

  2. If the monitor is not running, the startServerMonitor() function essentially calls the nohup command on the serverMonitor() function and runs it in the background, so you can close the terminal and the crash monitor will keep running. nohup doesn't allow you to pass functions, but I managed to recreate the functionality through some research. The PID of the monitor is stored in $WORLD_DIR/monitor.pid.

  3. The serverMonitor() function is essentially a loop that detects if the server has crashed by a combination of the a) the existence of the $WORLD.pid file and b) whether the server is running via the serverRunning() function. If the PID file exists but the server isn't running, then this means the server probably crashed (since MSCS hasn't removed the PID file).
    Additionally, I had to modify the serverRunning() function to use echo instead of return to return values so I could call the function in the test condition of the until loop and check the return value using $() (I didn't think I could do this the way the function was, but maybe I'm wrong). I subsequently modified all of the references in the script to call serverRunning using $().
    There is also a log for the monitor written to $WORLD_DIR/logs/mscs.monitor.log. This log is deleted after $LOG_DURATION to make sure it doesn't get too large. This check occurs in the startServerMonitor() function. There is also a log created called last-start-status.log that stores a message to be displayed later in the mscs status command.

  4. When the mscs stop command is called, it sends a request to kill the server monitor using the stopServerMonitor() function. At this point the monitor should be killed.

When a user runs the mscs status for the first time after a server has crashed, they will see a notice that the world has crashed:

$ mscs status
Minecraft Server Status:
  alex: running version 1.16.2 (0 of 20 users online).
    Port: 25567.
    Memory used: 1.04GB (2GB allocated).
    Server PID: 3351.
    Crash Monitor PID: 19569
    alex automatically restarted from a crash (or in-game stop command)
    on 2020-08-15_22-49-13. See
    /opt/mscs/worlds/alex/logs/mscs.monitor.log and
    /opt/mscs/worlds/alex/crash_reports/ and
    /opt/mscs/worlds/alex/logs/ for more information.

This notice goes away after the first time. The code behind this is simply a check if whether or not the last-start-status.log file exists and is not empty. If this is true, the message is displayed and the file is deleted afterwards (so the message is only displayed once).

P.S.: I tested crashes on the code by doing the /stop command in-game, I believe it should replicate what happens with an actual crash.

Please let me know of any questions or recommendations you have or bugs you find. Thanks!!

Summary of TODO:

  • Remove flock dependency
  • Test more
  • Question: add functionality to enable it on a per-world basis?
  • Question: make the default setting true or false?

Add automatic crash detection and restarts. Need to remove flock dependency
update wording
@zanix
Copy link
Member

zanix commented Aug 16, 2020

0 should be disabled and 1 should be enabled

@sandain
Copy link
Member

sandain commented Sep 29, 2020

I agree with @zanix, 0 should be disabled and 1 enabled. It would be confusing otherwise. A per world option in addition to the global option would be best. Default should be disabled I think.

Sorry about the slow review. I've been busy.

echo 0
else
return 1
echo 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is necessary. It forces a lot of other changes that make the code less readable.

E.g.

-  while serverRunning $1; do
+  while [ $(serverRunning $1) -eq 0 ]; do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually 0 means false, and 1 means true. UNIX land returns 0 when a command successfully ran and a value >0 for error states. This return trick was used here to make the code more readable, but the values are inverted from what you mean when changed to echo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this. It's confusing because, as you said, in UNIX land 0 is returned when the command ran successfully. In my head, I associate success with being "TRUE", so it makes me think it should return 1 based on the 0=false and 1=true convention, but it doesn't.... But anyway, I think I understand now, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add comments to the serverRunning method to note this UNIX oddity.

serverRunning() {
  # Try to determine if the world is running.
  if [ $(getJavaPID "$1") -gt 0 ]; then
-    return 0
+    return 0  # Return a true value.
  else
-    return 1
+    return 1  # Return a false value.
  fi
}

Copy link
Member Author

@michael2847 michael2847 Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandain do you know why if [ serverRunning $1 ] doesn't work? it only works when i omit the brackets and I can't figure out how to do it if I want to negate the expression

Copy link
Member

@sandain sandain Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need spaces next to the square brackets.

if [ serverRunning $1 ]; then

to negate:

if [ ! serverRunning $1 ]; then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, I think the brackets are optional.

if ! serverRunning $1; then


printf "[$(timestamp)] [INFO]: Server monitoring started for $1. Server PID: $(getJavaPID $1). Monitor PID: $MONITOR_PID.\n"
# Run monitor until the server is stopped and the PID file is removed (i.e. clean shutdown).
until [ $(serverRunning $1) -eq 1 ] && [ ! -f "$WORLDS_LOCATION/$1.pid" ]; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line would become:

until [ ! serverRunning $1 ] && [ ! -f "$WORLDS_LOCATION/$1.pid" ]; do

Which I think is more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed

@michael2847
Copy link
Member Author

I'm going to close this PR and re-open one that is up-to-date with the current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants