Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce compute listener #110400

Merged
merged 5 commits into from
Jul 5, 2024
Merged

Introduce compute listener #110400

merged 5 commits into from
Jul 5, 2024

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 2, 2024

Currently, if a child request fails, we automatically trigger cancellation for ES|QL requests. This can result in TaskCancelledException being collected by the RefCountingListener first, which then returns that exception to the caller. For example, if we encounter a CircuitBreakingException (429), we might incorrectly return a TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of RefCountingListener, which ignores unimportant exceptions and selects the most appropriate exception to return to the caller. I also integrated the following features into ComputeListener to simplify ComputeService:

  1. Automatic cancellation of sub-tasks on failure.
  2. Collection of profiles from sub-tasks.
  3. Collection of response headers from sub-tasks.

@dnhatn dnhatn force-pushed the compute-listener branch 5 times, most recently from fb4657b to b560b5a Compare July 3, 2024 01:41
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn marked this pull request as ready for review July 3, 2024 03:25
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 3, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on collecting the listener behavior into a named class.

Do you think it's appropriate to pull that class into it's own top level file? Do you think it's appropriate to make a test case specifically for it?

I know this has been quite the distraction and I'm proposing more distractions, but the listener behavior is pretty complex and a test might help other people to hold it in their head.

import java.util.concurrent.atomic.AtomicReference;

/**
* Collects exceptions occurred in compute, specifically ignores trivial exceptions such as {@link TaskCancelledException},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment isn't quite right I think - if there aren't any exceptions captured it'll capture the first "trivial exception". And it'll replace that with the first non-trivial exception it sees.

if (ExceptionsHelper.unwrapCause(first) != ExceptionsHelper.unwrapCause(e)) {
if (exceptionPermits.tryAcquire()) {
first.addSuppressed(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAndUpdate is documented as requiring a side-effect-free function because it'll be retried sometimes. I think this is safe, but it's not really side effect free. Maybe it'd be more readable if we used an atomicarray or something. Either way, I spent a while reading this code and thinking about it, but I don't think the performance is all that critical here because it's deep into exception path.

@dnhatn
Copy link
Member Author

dnhatn commented Jul 5, 2024

@nik9000 Thank you for the feedback. I have made the ComputeLister its own class and added tests for it. I also updated the javadoc and implementation of FailureCollector to use concurrent queues. This should be ready for review again. Could you please take another look?

@dnhatn
Copy link
Member Author

dnhatn commented Jul 5, 2024

Thanks Nik!

@dnhatn dnhatn merged commit 81f95b9 into elastic:main Jul 5, 2024
15 checks passed
@dnhatn dnhatn deleted the compute-listener branch July 5, 2024 19:55
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
elasticsearchmachine pushed a commit that referenced this pull request Jul 5, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
elasticsearchmachine pushed a commit that referenced this pull request Jul 6, 2024
Currently, if a child request fails, we automatically trigger  cancellation
for ES|QL requests. This can result in TaskCancelledException being
collected by the RefCountingListener first, which then returns that
exception to the caller. For example, if we encounter a
CircuitBreakingException (429), we might incorrectly return a
TaskCancelledException (400) instead.

This change introduces the ComputeListener, a variant of
RefCountingListener, which selects the most appropriate exception to return
to the caller. I also integrated the following features into ComputeListener to
simplify ComputeService:

- Automatic cancellation of sub-tasks on failure.
- Collection of profiles from sub-tasks.
- Collection of response headers from sub-tasks.
@elastic elastic deleted a comment from elasticsearchmachine Jul 7, 2024
elasticsearchmachine pushed a commit that referenced this pull request Jul 8, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates elastic#110400 Closes elastic#110591
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates elastic#110400 Closes elastic#110591
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
elasticsearchmachine pushed a commit that referenced this pull request Jul 9, 2024
ComputeResponse from old nodes may have a null value instead of an empty
list for profiles.

Relates #110400 Closes #110591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.14.3 v8.15.1 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants