Skip to content

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 2, 2024

Currently, if a child request fails, we automatically trigger cancellation for ES|QL requests. This can result in TaskCancelledException being collected by the RefCountingListener first, which then returns that exception to the caller. For example, if we encounter a CircuitBreakingException (429), we might incorrectly return a TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of RefCountingListener, which ignores unimportant exceptions and selects the most appropriate exception to return to the caller. I also integrated the following features into ComputeListener to simplify ComputeService:

  1. Automatic cancellation of sub-tasks on failure.
  2. Collection of profiles from sub-tasks.
  3. Collection of response headers from sub-tasks.

@dnhatn dnhatn force-pushed the compute-listener branch 5 times, most recently from fb4657b to b560b5a Compare July 3, 2024 01:41
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn requested review from nik9000 and ChrisHegarty July 3, 2024 03:24
@dnhatn dnhatn marked this pull request as ready for review July 3, 2024 03:25
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 3, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on collecting the listener behavior into a named class.

Do you think it's appropriate to pull that class into it's own top level file? Do you think it's appropriate to make a test case specifically for it?

I know this has been quite the distraction and I'm proposing more distractions, but the listener behavior is pretty complex and a test might help other people to hold it in their head.

import java.util.concurrent.atomic.AtomicReference;

/**
* Collects exceptions occurred in compute, specifically ignores trivial exceptions such as {@link TaskCancelledException},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't quite right I think - if there aren't any exceptions captured it'll capture the first "trivial exception". And it'll replace that with the first non-trivial exception it sees.

if (ExceptionsHelper.unwrapCause(first) != ExceptionsHelper.unwrapCause(e)) {
if (exceptionPermits.tryAcquire()) {
first.addSuppressed(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAndUpdate is documented as requiring a side-effect-free function because it'll be retried sometimes. I think this is safe, but it's not really side effect free. Maybe it'd be more readable if we used an atomicarray or something. Either way, I spent a while reading this code and thinking about it, but I don't think the performance is all that critical here because it's deep into exception path.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 5, 2024

@nik9000 Thank you for the feedback. I have made the ComputeLister its own class and added tests for it. I also updated the javadoc and implementation of FailureCollector to use concurrent queues. This should be ready for review again. Could you please take another look?

@dnhatn
Copy link
Member Author

dnhatn commented Jul 5, 2024

Thanks Nik!

@dnhatn dnhatn merged commit 81f95b9 into elastic:main Jul 5, 2024
@dnhatn dnhatn deleted the compute-listener branch July 5, 2024 19:55
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
elasticsearchmachine pushed a commit that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
elasticsearchmachine pushed a commit that referenced this pull request Jul 6, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
@elastic elastic deleted a comment from elasticsearchmachine Jul 7, 2024
elasticsearchmachine pushed a commit that referenced this pull request Jul 8, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates elastic#110400 Closes elastic#110591
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates elastic#110400 Closes elastic#110591
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
tvernum pushed a commit that referenced this pull request Feb 25, 2025
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that 
exception to the caller. For example, if we encounter a 
CircuitBreakingException (429), we might incorrectly return a 
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of 
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
tvernum pushed a commit that referenced this pull request Feb 25, 2025
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
tvernum pushed a commit that referenced this pull request Feb 25, 2025
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that 
exception to the caller. For example, if we encounter a 
CircuitBreakingException (429), we might incorrectly return a 
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of 
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
tvernum pushed a commit that referenced this pull request Feb 25, 2025
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.3 v8.15.1 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants