Skip to content

Commit

Permalink
fix(signals): handle shutdown signals cleanly on failed startup (#44)
Browse files Browse the repository at this point in the history
* fix(signals): handle shutdown signals cleanly on failed startup

* add config property to replace magic number

* guard against null http which can occur in startup failure

* add Runtime shutdown hook to help ensure clean shutdown

* fix null value in log message

* fix formatting of comment

* move two config properties under 'webclient' subsection

* extract a magic constant to config

* fixup! move two config properties under 'webclient' subsection

* extract fixed constant to config property

* extract max upload timeout to config property

* DI object mapper

* correct reported default value

* extract list of handled signals to config property

* set safe ssl defaults

* update README

* use extract variable ref
  • Loading branch information
andrewazores authored Jan 30, 2023
1 parent 82f9df8 commit ff6c4cd
Show file tree
Hide file tree
Showing 8 changed files with 295 additions and 93 deletions.
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,21 @@ and how it advertises itself to a Cryostat server instance. Required properties
- [ ] `cryostat.agent.hostname` [`String`]: the hostname for this application instance. This will be used for the published JMX connection URL. If not provided then the default is to attempt to resolve the localhost hostname.
- [ ] `cryostat.agent.realm` [`String`]: the Cryostat Discovery API "realm" that this agent belongs to. This should be unique per agent instance. The default includes the `cryostat.agent.app.name` and a random UUID.
- [ ] `cryostat.agent.authorization` [`String`]: Authorization header value to include with API requests to the Cryostat server, ex. `Bearer abcd1234`. Default `None`.
- [ ] `cryostat.agent.ssl.trust-all` [`boolean`]: Control whether the agent trusts all certificates presented by the Cryostat server. Default `true`.
- [ ] `cryostat.agent.ssl.verify-hostname` [`boolean`]: Control whether the agent verifies hostnames on certificates presented by the Cryostat server. Default `false`.
- [ ] `cryostat.agent.webclient.ssl.trust-all` [`boolean`]: Control whether the agent trusts all certificates presented by the Cryostat server. Default `false`. This should only be overridden for development and testing purposes, never in production.
- [ ] `cryostat.agent.webclient.ssl.verify-hostname` [`boolean`]: Control whether the agent verifies hostnames on certificates presented by the Cryostat server. Default `true`. This should only be overridden for development and testing purposes, never in production.
- [ ] `cryostat.agent.webclient.connect.timeout-ms` [`long`]: the duration in milliseconds to wait for HTTP requests to the Cryostat server to connect. Default `1000`.
- [ ] `cryostat.agent.webclient.response.timeout-ms` [`long`]: the duration in milliseconds to wait for HTTP requests to the Cryostat server to respond. Default `1000`.
- [ ] `cryostat.agent.webserver.host` [`String`]: the internal hostname or IP address for the embedded webserver to bind to. Default `0.0.0.0`.
- [ ] `cryostat.agent.webserver.port` [`int`]: the internal port number for the embedded webserver to bind to. Default `9977`.
- [ ] `cryostat.agent.app.name` [`String`]: a human-friendly name for this application. Default `cryostat-agent`.
- [ ] `cryostat.agent.app.jmx.port` [`int`]: the JMX RMI port that the application is listening on. The default is to attempt to determine this from the `com.sun.management.jmxremote.port` system property.
- [ ] `cryostat.agent.registration.retry-ms` [`long`]: the duration in milliseconds between attempts to register with the Cryostat server. Default `5000`.
- [ ] `cryostat.agent.exit.signals` [`[String]`]: a comma-separated list of signals that the agent should handle. When any of these signals is caught the agent initiates an orderly shutdown, deregistering from the Cryostat server and potentially uploading the latest harvested JFR data. Default `INT,TERM`.
- [ ] `cryostat.agent.exit.deregistration.timeout-ms` [`long`]: the duration in milliseconds to wait for a response from the Cryostat server when attempting to deregister at shutdown time . Default `3s`.
- [ ] `cryostat.agent.harvester.period-ms` [`long`]: the length of time between JFR collections and pushes by the harvester. This also controls the maximum age of data stored in the buffer for the harvester's managed Flight Recording. Every `period-ms` the harvester will upload a JFR binary file to the `cryostat.agent.baseuri` archives. Default `-1`, which indicates no harvesting will be performed.
- [ ] `cryostat.agent.harvester.template` [`String`]: the name of the `.jfc` event template configuration to use for the harvester's managed Flight Recording. Default `default`, the continuous monitoring event template.
- [ ] `cryostat.agent.harvester.max-files` [`String`]: the maximum number of pushed files that Cryostat will keep over the network from the agent. This is supplied to the harvester's push requests which instructs Cryostat to prune, in a FIFO manner, the oldest JFR files within the attached JVM target's storage, while the number of stored recordings is greater than this configuration's maximum file limit. Default `10`.
- [ ] `cryostat.agent.harvester.max-files` [`String`]: the maximum number of pushed files that Cryostat will keep over the network from the agent. This is supplied to the harvester's push requests which instructs Cryostat to prune, in a FIFO manner, the oldest JFR files within the attached JVM target's storage, while the number of stored recordings is greater than this configuration's maximum file limit. Default `2147483647` (`Integer.MAX_VALUE`).
- [ ] `cryostat.agent.harvester.upload.timeout-ms` [`long`]: the duration in milliseconds to wait for HTTP upload requests to the Cryostat server to complete and respond. Default `30000`.
- [ ] `cryostat.agent.harvester.exit.max-age-ms` [`long`]: the JFR `maxage` setting, specified in milliseconds, to apply to recording data uploaded to the Cryostat server when the JVM this Agent instance is attached to exits. This ensures that tail-end data is captured between the last periodic push and the application exit. Exit uploads only occur when the application receives `SIGINT`/`SIGTERM` from the operating system or container platform.
- [ ] `cryostat.agent.harvester.exit.max-size-b` [`long`]: the JFR `maxsize` setting, specified in bytes, to apply to exit uploads as described above.

Expand Down
227 changes: 167 additions & 60 deletions src/main/java/io/cryostat/agent/Agent.java
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,17 @@
*/
package io.cryostat.agent;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicBoolean;

import javax.inject.Named;
import javax.inject.Singleton;

import dagger.Component;
Expand All @@ -53,73 +59,86 @@
public class Agent {

private static Logger log = LoggerFactory.getLogger(Agent.class);
private static final AtomicBoolean needsCleanup = new AtomicBoolean(true);

public static void main(String[] args) {
final Client client = DaggerAgent_Client.builder().build();

List.of(new Signal("INT"), new Signal("TERM"))
.forEach(
signal -> {
SignalHandler oldHandler = Signal.handle(signal, s -> {});
SignalHandler handler =
s -> {
log.info("Caught SIG{}({})", s.getName(), s.getNumber());
try {
client.harvester().exitUpload().get();
} catch (InterruptedException | ExecutionException e) {
log.error("Exit upload failed", e);
} finally {
client.registration()
.deregister()
.orTimeout(1, TimeUnit.SECONDS)
.thenRunAsync(
() -> {
try {
log.info("Shutting down...");
client.webServer().stop();
client.registration().stop();
client.executor().shutdown();
} catch (Exception e) {
log.warn(
"Exception during"
+ " shutdown",
e);
} finally {
log.info("Shutdown complete");
oldHandler.handle(s);
}
},
client.executor());
}
};
Signal.handle(signal, handler);
});

AgentExitHandler agentExitHandler = null;
try {
client.registration()
.addRegistrationListener(
evt -> {
switch (evt.state) {
case REGISTERED:
client.harvester().start();
break;
case UNREGISTERED:
client.harvester().stop();
break;
case REFRESHED:
break;
default:
log.error("Unknown registration state: {}", evt.state);
break;
final Client client = DaggerAgent_Client.builder().build();
Registration registration = client.registration();
Harvester harvester = client.harvester();
WebServer webServer = client.webServer();
ExecutorService executor = client.executor();
List<String> exitSignals = client.exitSignals();
long exitDeregistrationTimeout = client.exitDeregistrationTimeout();

agentExitHandler =
installSignalHandlers(
exitSignals,
registration,
harvester,
webServer,
executor,
exitDeregistrationTimeout);
final AgentExitHandler fHandler = agentExitHandler;
Thread t =
new Thread(
() -> {
if (needsCleanup.getAndSet(false)) {
fHandler.performCleanup(null);
}
});
client.registration().start();
client.webServer().start();
t.setName("cryostat-agent-shutdown");
t.setDaemon(false);
Runtime.getRuntime().addShutdownHook(t);

registration.addRegistrationListener(
evt -> {
switch (evt.state) {
case REGISTERED:
harvester.start();
break;
case UNREGISTERED:
harvester.stop();
break;
case REFRESHED:
break;
default:
log.error("Unknown registration state: {}", evt.state);
break;
}
});
registration.start();
webServer.start();
log.info("Startup complete");
} catch (Exception e) {
log.error(Agent.class.getSimpleName() + " startup failure", e);
return;
if (agentExitHandler != null) {
agentExitHandler.reset();
}
}
}

private static AgentExitHandler installSignalHandlers(
List<String> exitSignals,
Registration registration,
Harvester harvester,
WebServer webServer,
ExecutorService executor,
long exitDeregistrationTimeout) {
AgentExitHandler agentExitHandler =
new AgentExitHandler(
registration, harvester, webServer, executor, exitDeregistrationTimeout);
for (String s : exitSignals) {
Signal signal = new Signal(s);
try {
SignalHandler oldHandler = Signal.handle(signal, agentExitHandler);
agentExitHandler.setOldHandler(signal, oldHandler);
} catch (IllegalArgumentException iae) {
log.warn("Unable to register signal handler for SIG" + s, iae);
}
}
log.info("Startup complete");
return agentExitHandler;
}

public static void agentmain(String args) {
Expand All @@ -130,7 +149,7 @@ public static void agentmain(String args) {
main(args == null ? new String[0] : args.split("\\s"));
});
t.setDaemon(true);
t.setName("cryostat-agent");
t.setName("cryostat-agent-main");
t.start();
}

Expand All @@ -149,9 +168,97 @@ interface Client {

ScheduledExecutorService executor();

@Named(ConfigModule.CRYOSTAT_AGENT_EXIT_SIGNALS)
List<String> exitSignals();

@Named(ConfigModule.CRYOSTAT_AGENT_EXIT_DEREGISTRATION_TIMEOUT_MS)
long exitDeregistrationTimeout();

@Component.Builder
interface Builder {
Client build();
}
}

private static class AgentExitHandler implements SignalHandler {

private static Logger log = LoggerFactory.getLogger(Agent.class);

private final Map<Signal, SignalHandler> oldHandlers = new HashMap<>();
private final Registration registration;
private final Harvester harvester;
private final WebServer webServer;
private final ExecutorService executor;
private final long exitDeregistrationTimeout;

private AgentExitHandler(
Registration registration,
Harvester harvester,
WebServer webServer,
ExecutorService executor,
long exitDeregistrationTimeout) {
this.registration = Objects.requireNonNull(registration);
this.harvester = Objects.requireNonNull(harvester);
this.webServer = Objects.requireNonNull(webServer);
this.executor = Objects.requireNonNull(executor);
this.exitDeregistrationTimeout = exitDeregistrationTimeout;
}

void setOldHandler(Signal signal, SignalHandler oldHandler) {
this.oldHandlers.put(signal, oldHandler);
}

@Override
public void handle(Signal sig) {
log.info("Caught SIG{}({})", sig.getName(), sig.getNumber());
if (needsCleanup.getAndSet(false)) {
performCleanup(sig);
}
}

void performCleanup(Signal sig) {
log.info("Performing cleanup...");
try {
harvester.exitUpload().get();
} catch (InterruptedException | ExecutionException e) {
log.error("Exit upload failed", e);
} finally {
registration
.deregister()
.orTimeout(exitDeregistrationTimeout, TimeUnit.MILLISECONDS)
.handleAsync(
(v, t) -> {
try {
log.info("Shutting down...");
safeCall(webServer::stop);
safeCall(registration::stop);
safeCall(executor::shutdown);
} finally {
log.info("Shutdown complete");
if (sig != null) {
// pass signal on to whatever would have handled it had
// this Agent not been installed, so host application
// has a chance to perform a graceful shutdown
oldHandlers.get(sig).handle(sig);
}
}
return null;
},
executor);
}
}

void reset() {
this.oldHandlers.forEach(Signal::handle);
this.oldHandlers.clear();
}

private void safeCall(Runnable r) {
try {
r.run();
} catch (Exception e) {
log.warn("Exception during shutdown", e);
}
}
}
}
Loading

0 comments on commit ff6c4cd

Please sign in to comment.