Skip to content

[Feature Proposal] Improve error reporting for task and query failures #11165

Open
@kfaraz

Description

@kfaraz

Motivation

Task and query failures in Druid are often difficult to analyze due to missing, incomplete or vague error messages.

A unified error reporting mechanism would improve the experience of a Druid user through:

  • Easier debugging and RCA (without looking at server logs)
  • Richer error messages detailing what went wrong and possible actions for mitigation
  • Homogeneous error reporting across different Druid services, modules and extensions
  • Specifying the severity of errors and other potential side effects
  • Hiding implementation and other sensitive details from the end user

Scope

The approach discussed here:

  • Aims to provide an extensible error reporting mechanism that both core Druid and extensions can use to report precise error messages
  • IS NOT a re-write of error reporting/handling and only proposes a cleaner method to provide a better end-user experience through the points listed above
  • Builds on existing error reporting mechanisms in Druid (e.g. Query execution failures).

Summary of Changes

Flow

  • Core Druid and extensions register their respective error types on startup on Overlord (extensions that are not loaded on Overlord have been addressed later)
  • An in-memory mapping is maintained from (moduleName, code) pair to the respective ErrorType
  • The persisted TaskStatus of any failed task contains ErrorTypeParams rather than the full error message
  • When the status of a Task is requested, the ErrorTypeParams of the TaskStatus are used by the ErrorMessageFormatter to construct the full error message, which is then sent back in the API response

Advantages

An Error Code based approach has the following advantages

  • Well-formed error codes that are recognized throughout the platform
  • Reference containing rich documentation for each error code, thus greatly simplifying debugging
  • Concise storage format (in metadata store) for task failures, as we would be required to only store the error code and the message arguments instead of a lengthy detailed error message. For instance, the current limit on the error message in a task failure is 100 characters. This greatly limits the amount of useful information that can be included in a task failure.
  • Less verbose logs
  • Easy updates to error messages as they would be maintained in a single place

Examples

Some typical examples could be as under:

Module Name Error Type Module Name + Error Code Message Format Message Args Fully formed Error Message
kafka-emitter Invalid topic name kafka-emitter.topic.invalid The given topic name [%s] is invalid. Please provide a valid topic name. "test-topic" The given topic name [test-topic] is invalid. Please provide a valid topic name.
kafka-indexer Offset out of range kafka-indexer.offset.outofrange The offset [%s] for topic [%s] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this might cause loss or duplication of data. "13927608","daily_transactions" The offset [13927608] for topic [daily_transactions] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this can cause loss or duplication of data.
druid-compaction Invalid Segment Spec druid-compaction.segmentspec.invalid Compaction of the datasource [%s] for interval [%s] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest". "daily_transactions","2021-01-01T00:00:00Z/2021-02-01T00:00:00Z" Compaction of the datasource [daily_transactions] for interval [2021-01-01T00:00:00Z/2021-02-01T00:00:00Z] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest".

New Classes

  • ErrorTypeProvider: Multi-bound interface to be implemented by core Druid as well as any extensions that needs to register error types

    • String getModuleName(): Namespace denoting name of the extension (or "druid" in case of core Druid). Must be unique across extensions.
    • List<ErrorType> getErrorTypes(): List of error types for the extension
  • ErrorType: Denotes a specific category of an error

    • int code: Integer code denoting a specific type of error within the namespace. Must be unique within the module.
    • String messageFormat: Contains placeholders that can be replaced to get the full error message
    • additional details e.g. severity
  • ErrorTypeParams: Denotes the occurrence of an error. Contains params to identify and format the actual ErrorType

    • String moduleName
    • int code
    • List<String> messageArgs: total length of args is limited (current limit on TaskStatus.errorMsg is 100)
  • DruidTypedException: exception that corresponds to an error type

    • ErrorTypeParams errorTypeParams
    • Throwable cause: optional
  • ErrorMessageFormatter: (singleton) class that maintains an in-memory mapping from (moduleName, code) pair to ErrorType

Code Snippets

Throwing an Exception

e.g., in an extension, say kafka-emitter, an exception could be thrown as below:

final String topicName = ...;
try {
	// ...
	// Publish to kafka topic here
	// ...
} catch (InvalidTopicException topicEx) {
	throw new DruidTypedException(
		ErrorTypeParams.of(
			KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
			KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error code
			// message arguments
			topicName),
		topicEx
	);
}

TaskStatus class (modified)

public class TaskStatus {

	// Existing field. Getting deprecated.
	private final @Nullable String errorMsg;

    // New field. Contains moduleName, errorCode and messageArgs
    // A TaskStatus with this field as null would fall back
    // to the existing field errorMsg
    private final @Nullable ErrorTypeParams errorTypeParams;

    //...
}

Creating a TaskStatus for a failed task

	TaskStatus status = TaskStatus.failure(
		taskId,
		ErrorTypeParams.of(moduleName, errorCode, messageArgs)
	);

Handling exceptions in future callbacks for task execution

public class TaskQueue {

	// ...
	private ListenableFuture<TaskStatus> attachCallbacks(final Task task) {
		// ...
		Futures.addCallback(new FutureCallback<TaskStatus> {
			
			@Override
			public void onSuccess(final TaskStatus successStatus) {
				// persist the successStatus
			}

			@Override
			public void onFailure(final Throwable t) {
				TaskStatus failureStatus;
				if (t is DruidTypedException) {
					failureStatus = TaskStatus.failure(task.getId(), ((DruidTypedException) t).getErrorTypeParams());
				} else {
					// build a generic error message here
					failureStatus = TaskStatus.failure(task.getId(), ...);
				}
				// persist the failureStatus
			}
		});
	}

	// ...

}

Registering Error Types

Some of the snippets below use an extension kafka-emitter as an example.

Binding the ErrorTypeProvider

@Override
public void configure(Binder binder) {
	Multibinder.newSetBinder(binder, ErrorTypeProvider.class)
	    .addBinding().to(KafkaEmitterErrorTypeProvider.class);
}

Listing the error types

public class KafkaEmitterErrorTypeProvider {

	@Override
	public String getModuleName() {
		return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
	}

	@Override
	public List<ErrorType> getErrorTypes() {
		// return the list of all error types for this extension
		return Arrays.asList(
			...
			// example error type for invalid topic
			ErrorType.of(
				KafkaEmitterErrorTypes.INVALID_TOPIC, 
				"The given topic name [%s] is invalid. Please provide a valid topic name.")
			...
		);
	}

}

Mapping error codes to types (provided by core Druid):

public class ErrorMessageFormatter {

	private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes = ...;

	@Inject
	public ErrorMessageFormatter(
		Set<ErrorTypeProvider> errorTypeProviders) {
		
		for (ErrorTypeProvider provider : errorTypeProviders) {
			// Ensure that there are no module name clashes
			final String moduleName = provider.getModuleName();

			// Add all error types to the map
			Map<Integer, ErrorType> errorTypeMap = new ConcurrentHashMap<>();
			for (ErrorType errorType : provider.getErrorTypes()) {
				errorTypeMap.put(errorType.getCode(), errorType);
			}
		}
	}
}

Building the full error message (to serve UI requests)

public class ErrorMessageFormatter {

	...

	public String getErrorMessage(ErrorTypeParams errorParams) {
		ErrorType errorType = moduleToErrorTypes
		    .get(errorParams.getModuleName())
		    .get(errorParams.getCode());

		return String.format(
			errorType.getMessageFormat(),
			errorParams.getMessageArgs().toArray());
	}

	...

}

API Changes

  • The internal error codes and module names discussed above would not be exposed to the User or any REST API that powers the UI.
  • Thus most user facing APIs that include an errorMsg (or equivalent) field need not change. REST API clients can still continue to use the same errorMsg fields as before, except now those error messages would have richer information.
  • Internal REST APIs (such as those b/w the Broker and Historical) would now have an additional field viz. errorInfowhich would have the error code, module name, etc. The field errorInfo deprecates the existing field errorMsg (or equivalent). Thus, if the errorInfo is non-null, it would be used to determine the full error message, otherwise the existing errorMsg field would be used.

Operational Impact

The changes would be backward compatible as none of the existing fields would be removed.

Case: Historical is upgraded to a new (error code aware) version but Broker is still on an older version
In such cases, the Broker would not be aware of error codes. Thus the Historical should send a non-null errorMsg alongwith a non-null errorInfo. This ensures that a Broker on an old version would be able to use the errorMsg whereas a Broker on a newer version would be able to use the errorInfo.

Design Concerns

Task Failures

Ingestion and compaction tasks are managed by the Overlord. Thus, the Overlord needs to be aware of the error types to be able to serve task statuses over REST APIs.

Query Failures

Queries (SQL and native) are submitted over HTTP connections and the response can contain the detailed error message in case of failures. Thus the Broker need not be aware of the list of error types as there is no persistence of query status (and hence no requirement of persisting error codes and formatting the error messages when requested).

Extensions that are not loaded on Overlord

There are several extensions in Druid which are not loaded on the Overlord and run only on the Middle Managers/Peons. As these are not loaded on the Overlord, it is not aware of the error types that these extensions can throw.

The approach here can be similar to that in Query Failures above. While communicating to the Overlord, the Middle Manager can send back both the ErrorType object (denotes the category of the error) and the ErrorTypeParams (denotes a specific error event). The Overlord can then persist the received ErrorTypeParams in its task status while also adding an entry to its error type mappings.

Storing the mappings from Error Code to Error Type

In the design discussed above, the error types are maintained in-memory (in the Overlord). If extensions register too many error codes for rare scenarios, it would have an unnecessarily large memory usage which could have been used otherwise.

An alternative approach could be to persist the error types in the metadata store accessed via a small in-memory cache.

Pros:

  • Only the frequently occurring error types would be present in the warmed up cache.
  • Central repo for all error types that can be accessed by both Overlord and Coordinator

Cons:

  • De-duplication of module names and integer error codes would be more expensive

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions