Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record column name for exceptions while writing frames in RowBasedFrameWriter #16130

Merged

Conversation

gargvishesh
Copy link
Contributor

@gargvishesh gargvishesh commented Mar 15, 2024

Current Runtime Exceptions generated while writing frames only include the exception itself without including the name of the column they were encountered in, for e.g.

java.lang.RuntimeException: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')

Start including additional details for easier debugging.

New Exception:
[InvalidFieldError](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#error_InvalidFieldError): Error[class java.lang.Long cannot be cast to class java.lang.String (java.lang.Long and java.lang.String are in module java.base of loader 'bootstrap')] while writing field for source[external input source: LocalInputSource{baseDir="null", filter=null, files=[foo]}], rowNumber[1], column[arrayString].

Tested the generated exception by raising it explicitly in a custom build.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

  • I have added a few comments about the relevance of the new error class.
  • Can you please add a test case where the exception gets generated?
  • Also, what does it mean for MSQ, should MSQ retry if it encounters such exceptions or not? It seems like retrying shouldn't help here, but per my understanding, MSQ will still treat errors like these as UnknownFaults, and will retry. We should update that logic in MSQ (check how we do it for InvalidNullByte)

Copy link
Contributor

@LakshSingla LakshSingla Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that we need this class anywhere. Can't we use a generic RuntimeException with the proper error message? Or better yet, use DruidException, so that we can analyze the message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make a MSQ fault out of this. That way we donot have to adjust any retry logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting FrameFieldWriterException now to an MSQFault type

@@ -53,17 +50,16 @@ private InvalidNullByteException(
@Nullable final Integer position
)
{
super(StringUtils.format(
super(column, StringUtils.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird that we have this extra layer of indirection, to what will essentially call the RuntimeException. If we don't extract the relevant fields (.getColumn()) from the FrameFieldWriterException anywhere, or use check for it's instanceof, then I suppose we can get rid of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to MSQFault now for which the specific exception class is required. InvalidNullByteException now extends RuntimeException directly.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 21, 2024
@gargvishesh
Copy link
Contributor Author

  • Can you please add a test case where the exception gets generated?

Tried multiple routes incl insert via external inline tables and insert via joins, but an error isn't generated. Therefore not adding the test case.

  • Also, what does it mean for MSQ, should MSQ retry if it encounters such exceptions or not? It seems like retrying shouldn't help here, but per my understanding, MSQ will still treat errors like these as UnknownFaults, and will retry. We should update that logic in MSQ (check how we do it for InvalidNullByte)

The conversion to FrameFieldWriterFault now puts it in the category of non-retryable errors.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, overall the changes LGTM

Comment on lines 2993 to 2998
FrameFieldWriterException.builder()
.source(ffre.getSource())
.rowNumber(ffre.getRowNumber())
.column(ffre.getColumn())
.errorMsg(ffre.getErrorMsg())
.build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the normal constructor since we are using all the parameters in a single place. Invalid null byte exception has the builder because it populates different information at different places up the call stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the ask is to remove the builder from FrameFieldWriterException. But we do build it progressively in ScanQueryFrameProcessor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this class to MSQFaultsSerdeTest for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've also added the description of the fault in reference.md and removed the term frame from the fault code and log message since that is user-facing.

@JsonTypeName(FrameFieldWriterFault.CODE)
public class FrameFieldWriterFault extends BaseMSQFault
{
static final String CODE = "FieldWriterError";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the class name, can you please update the class as well? Also, wrt the naming convention of other faults, we don't have "Error" in the suffix. I think something like "InvalidFieldFault" might be more suitable, and the name could be "InvalidField". I am open to other naming conventions as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with your suggestion. We do have UnknownFault->UnknownError though, but InvalidField sounded fine.

@@ -49,7 +51,7 @@ public FrameFieldWriterFault(
super(
CODE,
StringUtils.format(
"Error[%s] while writing frame for source[%s], rowNumber[%d] column[%s]",
"Error[%s] while writing a field for source[%s], rowNumber[%d], column[%s].",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Duplicate error message as FrameFieldWriterException , can extract the format string out. It will make the updates easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now propagate the RE's message directly to the Fault class.

.source(ffre.getSource())
.rowNumber(ffre.getRowNumber())
.column(ffre.getColumn())
.errorMsg(ffre.getErrorMsg())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add the problematic field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field value isn't available at the place the exception starts -- which is RowBasedFrameWriter#writeDataUsingFieldWriters

@@ -299,11 +300,15 @@
.column(signature.getColumnName(i))
.build();
}
catch (ParseException pe) {
throw Throwables.propagate(pe);

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
Throwables.propagate
should be avoided because it has been deprecated.
@gargvishesh gargvishesh requested a review from LakshSingla April 5, 2024 12:11
Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Failures are code coverage checks - but there isn't a known dataset that will trigger this. Since the changes are isolated, they shouldn't affect the current code.

@LakshSingla LakshSingla merged commit 9a4fb58 into apache:master Apr 9, 2024
75 of 85 checks passed
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants