Description
Motivation
Task and query failures in Druid are often difficult to analyze due to missing, incomplete or vague error messages.
A unified error reporting mechanism would improve the experience of a Druid user through:
- Easier debugging and RCA (without looking at server logs)
- Richer error messages detailing what went wrong and possible actions for mitigation
- Homogeneous error reporting across different Druid services, modules and extensions
- Specifying the severity of errors and other potential side effects
- Hiding implementation and other sensitive details from the end user
Scope
The approach discussed here:
- Aims to provide an extensible error reporting mechanism that both core Druid and extensions can use to report precise error messages
- IS NOT a re-write of error reporting/handling and only proposes a cleaner method to provide a better end-user experience through the points listed above
- Builds on existing error reporting mechanisms in Druid (e.g. Query execution failures).
Summary of Changes
Flow
- Core Druid and extensions register their respective error types on startup on Overlord (extensions that are not loaded on Overlord have been addressed later)
- An in-memory mapping is maintained from
(moduleName, code)
pair to the respectiveErrorType
- The persisted
TaskStatus
of any failed task containsErrorTypeParams
rather than the full error message - When the status of a Task is requested, the
ErrorTypeParams
of theTaskStatus
are used by theErrorMessageFormatter
to construct the full error message, which is then sent back in the API response
Advantages
An Error Code based approach has the following advantages
- Well-formed error codes that are recognized throughout the platform
- Reference containing rich documentation for each error code, thus greatly simplifying debugging
- Concise storage format (in metadata store) for task failures, as we would be required to only store the error code and the message arguments instead of a lengthy detailed error message. For instance, the current limit on the error message in a task failure is 100 characters. This greatly limits the amount of useful information that can be included in a task failure.
- Less verbose logs
- Easy updates to error messages as they would be maintained in a single place
Examples
Some typical examples could be as under:
Module Name | Error Type | Module Name + Error Code | Message Format | Message Args | Fully formed Error Message |
---|---|---|---|---|---|
kafka-emitter | Invalid topic name | kafka-emitter.topic.invalid | The given topic name [%s] is invalid. Please provide a valid topic name. | "test-topic" |
The given topic name [test-topic] is invalid. Please provide a valid topic name. |
kafka-indexer | Offset out of range | kafka-indexer.offset.outofrange | The offset [%s] for topic [%s] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this might cause loss or duplication of data. | "13927608" ,"daily_transactions" |
The offset [13927608] for topic [daily_transactions] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this can cause loss or duplication of data. |
druid-compaction | Invalid Segment Spec | druid-compaction.segmentspec.invalid | Compaction of the datasource [%s] for interval [%s] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest". | "daily_transactions" ,"2021-01-01T00:00:00Z/2021-02-01T00:00:00Z" |
Compaction of the datasource [daily_transactions] for interval [2021-01-01T00:00:00Z/2021-02-01T00:00:00Z] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest". |
New Classes
-
ErrorTypeProvider
: Multi-bound interface to be implemented by core Druid as well as any extensions that needs to register error typesString getModuleName()
: Namespace denoting name of the extension (or"druid"
in case of core Druid). Must be unique across extensions.List<ErrorType> getErrorTypes()
: List of error types for the extension
-
ErrorType
: Denotes a specific category of an errorint code
: Integer code denoting a specific type of error within the namespace. Must be unique within the module.String messageFormat
: Contains placeholders that can be replaced to get the full error message- additional details e.g. severity
-
ErrorTypeParams
: Denotes the occurrence of an error. Contains params to identify and format the actualErrorType
String moduleName
int code
List<String> messageArgs
: total length of args is limited (current limit onTaskStatus.errorMsg
is 100)
-
DruidTypedException
: exception that corresponds to an error typeErrorTypeParams errorTypeParams
Throwable cause
: optional
-
ErrorMessageFormatter
: (singleton) class that maintains an in-memory mapping from(moduleName, code)
pair toErrorType
Code Snippets
Throwing an Exception
e.g., in an extension, say kafka-emitter
, an exception could be thrown as below:
final String topicName = ...;
try {
// ...
// Publish to kafka topic here
// ...
} catch (InvalidTopicException topicEx) {
throw new DruidTypedException(
ErrorTypeParams.of(
KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error code
// message arguments
topicName),
topicEx
);
}
TaskStatus class (modified)
public class TaskStatus {
// Existing field. Getting deprecated.
private final @Nullable String errorMsg;
// New field. Contains moduleName, errorCode and messageArgs
// A TaskStatus with this field as null would fall back
// to the existing field errorMsg
private final @Nullable ErrorTypeParams errorTypeParams;
//...
}
Creating a TaskStatus for a failed task
TaskStatus status = TaskStatus.failure(
taskId,
ErrorTypeParams.of(moduleName, errorCode, messageArgs)
);
Handling exceptions in future callbacks for task execution
public class TaskQueue {
// ...
private ListenableFuture<TaskStatus> attachCallbacks(final Task task) {
// ...
Futures.addCallback(new FutureCallback<TaskStatus> {
@Override
public void onSuccess(final TaskStatus successStatus) {
// persist the successStatus
}
@Override
public void onFailure(final Throwable t) {
TaskStatus failureStatus;
if (t is DruidTypedException) {
failureStatus = TaskStatus.failure(task.getId(), ((DruidTypedException) t).getErrorTypeParams());
} else {
// build a generic error message here
failureStatus = TaskStatus.failure(task.getId(), ...);
}
// persist the failureStatus
}
});
}
// ...
}
Registering Error Types
Some of the snippets below use an extension kafka-emitter as an example.
Binding the ErrorTypeProvider
@Override
public void configure(Binder binder) {
Multibinder.newSetBinder(binder, ErrorTypeProvider.class)
.addBinding().to(KafkaEmitterErrorTypeProvider.class);
}
Listing the error types
public class KafkaEmitterErrorTypeProvider {
@Override
public String getModuleName() {
return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
}
@Override
public List<ErrorType> getErrorTypes() {
// return the list of all error types for this extension
return Arrays.asList(
...
// example error type for invalid topic
ErrorType.of(
KafkaEmitterErrorTypes.INVALID_TOPIC,
"The given topic name [%s] is invalid. Please provide a valid topic name.")
...
);
}
}
Mapping error codes to types (provided by core Druid):
public class ErrorMessageFormatter {
private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes = ...;
@Inject
public ErrorMessageFormatter(
Set<ErrorTypeProvider> errorTypeProviders) {
for (ErrorTypeProvider provider : errorTypeProviders) {
// Ensure that there are no module name clashes
final String moduleName = provider.getModuleName();
// Add all error types to the map
Map<Integer, ErrorType> errorTypeMap = new ConcurrentHashMap<>();
for (ErrorType errorType : provider.getErrorTypes()) {
errorTypeMap.put(errorType.getCode(), errorType);
}
}
}
}
Building the full error message (to serve UI requests)
public class ErrorMessageFormatter {
...
public String getErrorMessage(ErrorTypeParams errorParams) {
ErrorType errorType = moduleToErrorTypes
.get(errorParams.getModuleName())
.get(errorParams.getCode());
return String.format(
errorType.getMessageFormat(),
errorParams.getMessageArgs().toArray());
}
...
}
API Changes
- The internal error codes and module names discussed above would not be exposed to the User or any REST API that powers the UI.
- Thus most user facing APIs that include an
errorMsg
(or equivalent) field need not change. REST API clients can still continue to use the sameerrorMsg
fields as before, except now those error messages would have richer information. - Internal REST APIs (such as those b/w the Broker and Historical) would now have an additional field viz.
errorInfo
which would have the error code, module name, etc. The fielderrorInfo
deprecates the existing fielderrorMsg
(or equivalent). Thus, if theerrorInfo
is non-null, it would be used to determine the full error message, otherwise the existingerrorMsg
field would be used.
Operational Impact
The changes would be backward compatible as none of the existing fields would be removed.
Case: Historical is upgraded to a new (error code aware) version but Broker is still on an older version
In such cases, the Broker would not be aware of error codes. Thus the Historical should send a non-null errorMsg
alongwith a non-null errorInfo
. This ensures that a Broker on an old version would be able to use the errorMsg
whereas a Broker on a newer version would be able to use the errorInfo
.
Design Concerns
Task Failures
Ingestion and compaction tasks are managed by the Overlord. Thus, the Overlord needs to be aware of the error types to be able to serve task statuses over REST APIs.
Query Failures
Queries (SQL and native) are submitted over HTTP connections and the response can contain the detailed error message in case of failures. Thus the Broker need not be aware of the list of error types as there is no persistence of query status (and hence no requirement of persisting error codes and formatting the error messages when requested).
Extensions that are not loaded on Overlord
There are several extensions in Druid which are not loaded on the Overlord and run only on the Middle Managers/Peons. As these are not loaded on the Overlord, it is not aware of the error types that these extensions can throw.
The approach here can be similar to that in Query Failures above. While communicating to the Overlord, the Middle Manager can send back both the ErrorType
object (denotes the category of the error) and the ErrorTypeParams
(denotes a specific error event). The Overlord can then persist the received ErrorTypeParams
in its task status while also adding an entry to its error type mappings.
Storing the mappings from Error Code to Error Type
In the design discussed above, the error types are maintained in-memory (in the Overlord). If extensions register too many error codes for rare scenarios, it would have an unnecessarily large memory usage which could have been used otherwise.
An alternative approach could be to persist the error types in the metadata store accessed via a small in-memory cache.
Pros:
- Only the frequently occurring error types would be present in the warmed up cache.
- Central repo for all error types that can be accessed by both Overlord and Coordinator
Cons:
- De-duplication of module names and integer error codes would be more expensive