[Feature Proposal] Improve error reporting for task and query failures

## Motivation

Task and query failures in Druid are often difficult to analyze due to missing, incomplete or vague error messages.

A unified error reporting mechanism would improve the experience of a Druid user through:
- Easier debugging and RCA (without looking at server logs)
- Richer error messages detailing what went wrong and possible actions for mitigation
- Homogeneous error reporting across different Druid services, modules and extensions
- Specifying the severity of errors and other potential side effects
- Hiding implementation and other sensitive details from the end user

## Scope
The approach discussed here:
- Aims to provide an extensible error reporting mechanism that both core Druid and extensions can use to report precise error messages
- IS NOT a re-write of error reporting/handling and only proposes a cleaner method to provide a better end-user experience through the points listed above
- Builds on existing error reporting mechanisms in Druid (e.g. [Query execution failures](http://druid.apache.org/docs/latest/querying/querying.html#query-execution-failures)).

## Summary of Changes

### Flow

- Core Druid and extensions register their respective error types on startup on Overlord (extensions that are not loaded on Overlord have been addressed later)
- An in-memory mapping is maintained from `(moduleName, code)` pair to the respective `ErrorType`
- The persisted `TaskStatus` of any failed task contains `ErrorTypeParams` rather than the full error message
- When the status of a Task is requested, the `ErrorTypeParams` of the `TaskStatus` are used by the `ErrorMessageFormatter` to construct the full error message, which is then sent back in the API response

### Advantages

An Error Code based approach has the following advantages
- Well-formed error codes that are recognized throughout the platform
- Reference containing rich documentation for each error code, thus greatly simplifying debugging
- Concise storage format (in metadata store) for task failures, as we would be required to only store the error code and the message arguments instead of a lengthy detailed error message. For instance, the current limit on the error message in a task failure is 100 characters. This greatly limits the amount of useful information that can be included in a task failure.
- Less verbose logs
- Easy updates to error messages as they would be maintained in a single place

### Examples

Some typical examples could be as under:

Module Name  | Error Type | Module Name + Error Code | Message Format | Message Args | Fully formed Error Message
-------------|------------|------------|----------------|--------------|----------------------------
kafka-emitter | Invalid topic name | kafka-emitter.topic.invalid  | The given topic name [%s] is invalid. Please provide a valid topic name. | `"test-topic"` | The given topic name [test-topic] is invalid. Please provide a valid topic name.
kafka-indexer | Offset out of range | kafka-indexer.offset.outofrange | The offset [%s] for topic [%s] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this might cause loss or duplication of data. | `"13927608"`,`"daily_transactions"` | The offset [13927608] for topic [daily_transactions] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this can cause loss or duplication of data.
druid-compaction | Invalid Segment Spec | druid-compaction.segmentspec.invalid | Compaction of the datasource [%s] for interval [%s] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest". | `"daily_transactions"`,`"2021-01-01T00:00:00Z/2021-02-01T00:00:00Z"` | Compaction of the datasource [daily_transactions] for interval [2021-01-01T00:00:00Z/2021-02-01T00:00:00Z] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest".


### New Classes

- `ErrorTypeProvider`: Multi-bound interface to be implemented by core Druid as well as any extensions that needs to register error types
  - `String getModuleName()`: Namespace denoting name of the extension (or `"druid"` in case of core Druid). Must be unique across extensions.
  - `List<ErrorType> getErrorTypes()`: List of error types for the extension

- `ErrorType`: Denotes a specific category of an error
  - `int code`: Integer code denoting a specific type of error within the namespace. Must be unique within the module.
  - `String messageFormat`: Contains placeholders that can be replaced to get the full error message 
  - additional details e.g. severity

- `ErrorTypeParams`: Denotes the occurrence of an error. Contains params to identify and format the actual `ErrorType` 
  - `String moduleName`
  - `int code`
  - `List<String> messageArgs`: total length of args is limited (current limit on `TaskStatus.errorMsg` is 100)

- `DruidTypedException`: exception that corresponds to an error type
  - `ErrorTypeParams errorTypeParams`
  - `Throwable cause`: optional

- `ErrorMessageFormatter`: (singleton) class that maintains an in-memory mapping from `(moduleName, code)` pair to `ErrorType`

## Code Snippets

### Throwing an Exception

e.g., in an extension, say `kafka-emitter`, an exception could be thrown as below:
```java
final String topicName = ...;
try {
	// ...
	// Publish to kafka topic here
	// ...
} catch (InvalidTopicException topicEx) {
	throw new DruidTypedException(
		ErrorTypeParams.of(
			KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
			KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error code
			// message arguments
			topicName),
		topicEx
	);
}
```

### TaskStatus class (modified)

```java
public class TaskStatus {

	// Existing field. Getting deprecated.
	private final @Nullable String errorMsg;

    // New field. Contains moduleName, errorCode and messageArgs
    // A TaskStatus with this field as null would fall back
    // to the existing field errorMsg
    private final @Nullable ErrorTypeParams errorTypeParams;

    //...
}
```

### Creating a TaskStatus for a failed task

```java
	TaskStatus status = TaskStatus.failure(
		taskId,
		ErrorTypeParams.of(moduleName, errorCode, messageArgs)
	);
```

### Handling exceptions in future callbacks for task execution

```java
public class TaskQueue {

	// ...
	private ListenableFuture<TaskStatus> attachCallbacks(final Task task) {
		// ...
		Futures.addCallback(new FutureCallback<TaskStatus> {
			
			@Override
			public void onSuccess(final TaskStatus successStatus) {
				// persist the successStatus
			}

			@Override
			public void onFailure(final Throwable t) {
				TaskStatus failureStatus;
				if (t is DruidTypedException) {
					failureStatus = TaskStatus.failure(task.getId(), ((DruidTypedException) t).getErrorTypeParams());
				} else {
					// build a generic error message here
					failureStatus = TaskStatus.failure(task.getId(), ...);
				}
				// persist the failureStatus
			}
		});
	}

	// ...

}
```

### Registering Error Types

Some of the snippets below use an extension kafka-emitter as an example.

Binding the ErrorTypeProvider
```java
@Override
public void configure(Binder binder) {
	Multibinder.newSetBinder(binder, ErrorTypeProvider.class)
	    .addBinding().to(KafkaEmitterErrorTypeProvider.class);
}
```

Listing the error types
```java
public class KafkaEmitterErrorTypeProvider {

	@Override
	public String getModuleName() {
		return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
	}

	@Override
	public List<ErrorType> getErrorTypes() {
		// return the list of all error types for this extension
		return Arrays.asList(
			...
			// example error type for invalid topic
			ErrorType.of(
				KafkaEmitterErrorTypes.INVALID_TOPIC, 
				"The given topic name [%s] is invalid. Please provide a valid topic name.")
			...
		);
	}

}
```

Mapping error codes to types (provided by core Druid):
```java
public class ErrorMessageFormatter {

	private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes = ...;

	@Inject
	public ErrorMessageFormatter(
		Set<ErrorTypeProvider> errorTypeProviders) {
		
		for (ErrorTypeProvider provider : errorTypeProviders) {
			// Ensure that there are no module name clashes
			final String moduleName = provider.getModuleName();

			// Add all error types to the map
			Map<Integer, ErrorType> errorTypeMap = new ConcurrentHashMap<>();
			for (ErrorType errorType : provider.getErrorTypes()) {
				errorTypeMap.put(errorType.getCode(), errorType);
			}
		}
	}
}

```

### Building the full error message (to serve UI requests)

```java
public class ErrorMessageFormatter {

	...

	public String getErrorMessage(ErrorTypeParams errorParams) {
		ErrorType errorType = moduleToErrorTypes
		    .get(errorParams.getModuleName())
		    .get(errorParams.getCode());

		return String.format(
			errorType.getMessageFormat(),
			errorParams.getMessageArgs().toArray());
	}

	...

}


```

## API Changes
- The internal error codes and module names discussed above would not be exposed to the User or any REST API that powers the UI.
- Thus most user facing APIs that include an `errorMsg` (or equivalent) field need not change. REST API clients can still continue to use the same `errorMsg` fields as before, except now those error messages would have richer information.
- Internal REST APIs (such as those b/w the Broker and Historical) would now have an additional field viz. `errorInfo`which would have the error code, module name, etc. The field `errorInfo` deprecates the existing field `errorMsg` (or equivalent). Thus, if the `errorInfo` is non-null, it would be used to determine the full error message, otherwise the existing `errorMsg` field would be used.

## Operational Impact
The changes would be backward compatible as none of the existing fields would be removed.

Case: Historical is upgraded to a new (error code aware) version but Broker is still on an older version
In such cases, the Broker would not be aware of error codes. Thus the Historical should send a non-null `errorMsg` alongwith a non-null `errorInfo`. This ensures that a Broker on an old version would be able to use the `errorMsg` whereas a Broker on a newer version would be able to use the `errorInfo`.

## Design Concerns

### Task Failures
Ingestion and compaction tasks are managed by the Overlord. Thus, the Overlord needs to be aware of the error types to be able to serve task statuses over REST APIs.

### Query Failures
Queries (SQL and native) are submitted over HTTP connections and the response can contain the detailed error message in case of failures. Thus the Broker need not be aware of the list of error types as there is no persistence of query status (and hence no requirement of persisting error codes and formatting the error messages when requested).

### Extensions that are not loaded on Overlord
There are several extensions in Druid which are not loaded on the Overlord and run only on the Middle Managers/Peons. As these are not loaded on the Overlord, it is not aware of the error types that these extensions can throw.

The approach here can be similar to that in Query Failures above. While communicating to the Overlord, the Middle Manager can send back both the `ErrorType` object (denotes the category of the error) and the `ErrorTypeParams` (denotes a specific error event). The Overlord can then persist the received `ErrorTypeParams` in its task status while also adding an entry to its error type mappings.

### Storing the mappings from Error Code to Error Type
In the design discussed above, the error types are maintained in-memory (in the Overlord). If extensions register too many error codes for rare scenarios, it would have an unnecessarily large memory usage which could have been used otherwise.

An alternative approach could be to persist the error types in the metadata store accessed via a small in-memory cache.

Pros:
- Only the frequently occurring error types would be present in the warmed up cache.
- Central repo for all error types that can be accessed by both Overlord and Coordinator

Cons:
- De-duplication of module names and integer error codes would be more expensive


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Improve error reporting for task and query failures #11165

Motivation

Scope

Summary of Changes

Flow

Advantages

Examples

New Classes

Code Snippets

Throwing an Exception

TaskStatus class (modified)

Creating a TaskStatus for a failed task

Handling exceptions in future callbacks for task execution

Registering Error Types

Building the full error message (to serve UI requests)

API Changes

Operational Impact

Design Concerns

Task Failures

Query Failures

Extensions that are not loaded on Overlord

Storing the mappings from Error Code to Error Type

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Module Name	Error Type	Module Name + Error Code	Message Format	Message Args	Fully formed Error Message
kafka-emitter	Invalid topic name	kafka-emitter.topic.invalid	The given topic name [%s] is invalid. Please provide a valid topic name.	`"test-topic"`	The given topic name [test-topic] is invalid. Please provide a valid topic name.
kafka-indexer	Offset out of range	kafka-indexer.offset.outofrange	The offset [%s] for topic [%s] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this might cause loss or duplication of data.	`"13927608"`,`"daily_transactions"`	The offset [13927608] for topic [daily_transactions] is out of range. Please check your topic offsets. If the issue persists, consider a hard reset of the supervisor but this can cause loss or duplication of data.
druid-compaction	Invalid Segment Spec	druid-compaction.segmentspec.invalid	Compaction of the datasource [%s] for interval [%s] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest".	`"daily_transactions"`,`"2021-01-01T00:00:00Z/2021-02-01T00:00:00Z"`	Compaction of the datasource [daily_transactions] for interval [2021-01-01T00:00:00Z/2021-02-01T00:00:00Z] failed because the segments specified in the compaction spec are not the same as the segments currently in use. Some new segments have been published or some segments have been removed. Please consider increasing your "skipOffsetFromLatest".

[Feature Proposal] Improve error reporting for task and query failures #11165

Description

Motivation

Scope

Summary of Changes

Flow

Advantages

Examples

New Classes

Code Snippets

Throwing an Exception

TaskStatus class (modified)

Creating a TaskStatus for a failed task

Handling exceptions in future callbacks for task execution

Registering Error Types

Building the full error message (to serve UI requests)

API Changes

Operational Impact

Design Concerns

Task Failures

Query Failures

Extensions that are not loaded on Overlord

Storing the mappings from Error Code to Error Type

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions