Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change terms aggregation default execution mode for tsdb #101619

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

martijnvg
Copy link
Member

An tsdb backing index when current time is before index.time_series.end_time will receive a lot of writes. Building global ordinals doesn't make sense, because it is expensive and the next search can't reuse it.

This change changes the default execution mode for tsdb backing indices where this is the case. The default will then be MAP instead of GLOBAL_ORDINALS.

This should improve query latency.

An tsdb backing index when current time is before  index.time_series.end_time will receive a lot of writes. Building global ordinals doesn't make sense, because it is expensive and the next search can't reuse it.

This change changes the default execution mode for tsdb backing indices where this is the case. The default will then be MAP instead of GLOBAL_ORDINALS.

This should improve query latency.
@elasticsearchmachine
Copy link
Collaborator

Hi @martijnvg, I've created a changelog YAML for you.

@martijnvg martijnvg marked this pull request as ready for review November 3, 2023 06:47
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 3, 2023
@jpountz
Copy link
Contributor

jpountz commented Nov 3, 2023

As discussed today, it would be nice to have another execution mode that maps values to bucket ordinals like the MAP execution mode, but also takes advantage of the TSDB index sort by caching the bucket ordinal of the previous document. Then if the current document has the same value ordinal as the previous document, we know it has the same bucket ordinal as well and we can skip an expensive lookup in a BytesRefHash.

@martijnvg
Copy link
Member Author

I updated the PR to the suggested change, but instead implemented it as a CollectorSource so that much of the MapStringTermsAggregator can be reused. Unfortunately I'm seeing mixed results in the tsdb k8s rally track. I think this has various reasons:

  • There is a bug in the data set. The kubernetes.pod.uid field was is a dimension field contains a unique value per document. But in reality there should only 9932 unique pods in this data set (the cardinality of the pod name field is 9932). The suggested optimisation that was added then doesn't help.
  • The result of the queries are noisy. Indexing and searching happens at the same time (which is realistic for many use cases and that is the focus of that track). However because of that the query latency can fluctuate. This makes it difficult to see certain improvements.

The trend looks like that the all the last 15 minute searches benefit from this change. While the last 24 hours searches don't benefit from this change and actually report a worse query latency with this change. I suspect having a heuristic is estimate to match with X percent of the documents would make be helpful here.

I will rerun the rally track when the track's data set is regenerated.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants