-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Profile the fetch phase #77064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile the fetch phase #77064
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this by having the fetch phase optionally profile itself, attaching the profile results to the per-shard fetch results. I then merge those fetch results into the "search phase" profiling that we have been collecting for years. The rendered json of the profile result will only include a fetch
section on shards that performed a fetch.
...gin/src/main/java/org/elasticsearch/plugin/noop/action/search/TransportNoopSearchAction.java
Outdated
Show resolved
Hide resolved
"aggregations": [], | ||
"fetch": { | ||
"type": "fetch", | ||
"description": "fetch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type
and description
come from my using a raw ProfileResult
for the entire fetch phase. I'm not happy about them, but I wasn't sure if it was worth writing a whole top level class for the result or working around them some other way.
"fetch": { | ||
"type": "fetch", | ||
"description": "fetch", | ||
"time_in_nanos": 660555, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual nanoTime
for the entire fetch phase rather than the sum of the breakdown. Because we can do that in fetch. We couldn't really do that for aggs or queries.
"load_stored_fields_count": 5 | ||
}, | ||
"debug": { | ||
"stored_fields": ["_id", "_routing", "_source"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit right here is important to me. Some folks want to only fetch doc values and avoid stored fields entirely. With this PR you can profile the query and check that this array is empty. If it is then we skipped loading doc values. Unless something tricky is happening in the nested case that I didn't dig into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the debug section too!
"children": [ | ||
{ | ||
"type": "source", | ||
"description": "load _source", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get one of these per fetch sub phase we run.
@Override | ||
public String name() { | ||
return "percolator_highlight"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These power the type
and description
fields on the children
field in the fetch profile.
return; | ||
} | ||
|
||
Profiler profiler = context.getProfilers() == null ? Profiler.NOOP : context.getProfilers().getFetchProfiler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't super happy with how invasive the change was here. I'm happy that the sub-phases for the most part don't have to change which is good, but this bit is bigger than I'd like. I'd love any suggestions if anyone has any.
@@ -428,4 +475,57 @@ private static void fillDocAndMetaFields(SearchContext context, FieldsVisitor fi | |||
static boolean hasSequentialDocs(DocIdToIndex[] docs) { | |||
return docs.length > 0 && docs[docs.length-1].docId - docs[0].docId == docs.length - 1; | |||
} | |||
|
|||
interface Profiler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the internal interface and noop implementation to save from a bunch of null checks that'd have made the whole thing even more invasive. I'd be more than happy to get rid of them if anyone has any good ideas.
@SuppressWarnings("unchecked") | ||
private void hitExecute(FetchSourceContext fetchSourceContext, HitContext hitContext) { | ||
@SuppressWarnings("unchecked") | ||
private void hitExecute(FetchSourceContext fetchSourceContext, HitContext hitContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled into the phase object so we could collect debugging counters.
|
||
@Override | ||
public FetchSubPhaseProcessor getProcessor(FetchContext context) { | ||
if (context.scriptFields() == null) { | ||
if (context.scriptFields() == null || context.scriptFields().fields().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you don't ask for any script fields we tend to get an empty list here, not null. That empty list was causing us to always return the phase and run it. That's not a big deal at all because it just iterates an empty list, but it was clogging up the profile output.
faf97a9
to
2c4803d
Compare
This adds profiling to the fetch phase so we can tell when fetching is slower than we'd like and we can tell which portion of the fetch is slow. The output includes which stored fields were loaded, how long it took to load stored fields, which fetch sub-phases were run, and how long those fetch sub-phases took. Closes elastic#75892
I can't reproduce this failure. Its in the CCS dueling tests. Both look like they fetch the same 8 documents but they fetch them from different shards: https://gist.github.com/nik9000/809fa7865e3d5b1c7dd50e96b89e3452 |
Nope. I was on the wrong branch. I can reproduce. Checking. |
OK! It's because the test assumed that the entire search result would be the same when searching in minimize round trip mode vs non-minimized mode. If you minimize the round trips you will perform more fetches so, of course, there will be more fetches. |
So folks don't have to dig through the docs, this is what the fetch profile looks like:
|
I've found a few more unit tests I'd like to write. Incoming. |
500 lines of tests..... The profile stuff didn't have any round trip unit tests so I added a bunch. Those are useful to catch serialization issues super early. |
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is so close! I had one last comment.
import static org.hamcrest.Matchers.matchesPattern; | ||
import static org.hamcrest.Matchers.nullValue; | ||
|
||
public class SearchProfileQueryPhaseResultsTests extends ESTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be named SearchProfileResultsBuilderTests
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Profile results for the query phase run on all shards. | ||
*/ | ||
public class SearchProfileResultsBuilder { | ||
public static SearchProfileResultsBuilder build(Map<String, SearchProfileQueryPhaseResult> queryPhaseResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused about this part -- probably related to naming. Could we just have a constructor new SearchProfileResultsBuilder(Map<String, SearchProfileQueryPhaseResult>)
instead of this static method? I'm not sure why we need the special NOT_PROFILING
object instead of just passing null
in the original place in SearchPhaseController
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to get the null check out of search phase controller so it was easy to unit test without all of SearchPhaseController. This way there isn't a null check. I liked that the old code didn't allocate anything if we didn't profile and wanted to keep that too, but have it here.
Another idea is to make the build
method static and have it take a nullable builder param. Or just revert the whole change and keep the null check in SearchPhaseController. I just, like, didn't like it there. I'm happy anytime I can get things out of SearchPhaseController.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change has so many "moving parts" already, I'd be in favor of keeping the old strategy to keep it well-scoped. I also find it easier to understand. If we want we could follow-up with a refactor to simplify SearchPhaseController
a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think it was clearer that way but I'm not the only one that has to read it. I've changed it the way you asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong about that being the last comment. I have a few more small ones.
/** | ||
* Profiles from child-phases. | ||
*/ | ||
default List<ProfileResult> childProfiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No current subprocessor overrides this, could we remove until a follow-up where we implement profiling for inner_hits subphases?
try { | ||
List<FetchSubPhaseProcessor> processors = new ArrayList<>(); | ||
for (FetchSubPhase fsp : fetchSubPhases) { | ||
FetchSubPhaseProcessor processor = fsp.getProcessor(context); | ||
if (processor != null) { | ||
processors.add(processor); | ||
String type = fsp.getClass().getSimpleName().replaceAll("^Fetch", "").replaceAll("(FetchSub)?Phase$", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think it's nice to simply include the class name without replacements. It's still very readable and makes it easier to find the corresponding code for the phase.
* Profile results for the query phase run on all shards. | ||
*/ | ||
public class SearchProfileResultsBuilder { | ||
public static SearchProfileResultsBuilder build(Map<String, SearchProfileQueryPhaseResult> queryPhaseResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this change has so many "moving parts" already, I'd be in favor of keeping the old strategy to keep it well-scoped. I also find it easier to understand. If we want we could follow-up with a refactor to simplify SearchPhaseController
a bit.
Sure! That's cool. I'll go back to the null check method, remove the name
munging, and blast the leftover for child phases. I just missed it last
time.
…On Fri, Sep 10, 2021, 13:30 Julie Tibshirani ***@***.***> wrote:
***@***.**** commented on this pull request.
I was wrong about that being the last comment. I have a few more small
ones.
------------------------------
In
server/src/main/java/org/elasticsearch/search/fetch/FetchSubPhaseProcessor.java
<#77064 (comment)>
:
> @@ -28,4 +31,18 @@
*/
void process(HitContext hitContext) throws IOException;
+ /**
+ * Called when profiling after processing all documents to get any extra
+ * debug information the phase collected.
+ */
+ default Map<String, Object> getDebugInfo() {
+ return null;
+ }
+
+ /**
+ * Profiles from child-phases.
+ */
+ default List<ProfileResult> childProfiles() {
No current subprocessor overrides this, could we remove until a follow-up
where we implement profiling for inner_hits subphases?
------------------------------
In server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java
<#77064 (comment)>
:
> try {
List<FetchSubPhaseProcessor> processors = new ArrayList<>();
for (FetchSubPhase fsp : fetchSubPhases) {
FetchSubPhaseProcessor processor = fsp.getProcessor(context);
if (processor != null) {
- processors.add(processor);
+ String type = fsp.getClass().getSimpleName().replaceAll("^Fetch", "").replaceAll("(FetchSub)?Phase$", "");
Personally I think it's nice to simply include the class name without
replacements. It's still very readable and makes it easier to find the
corresponding code for the phase.
------------------------------
In
server/src/main/java/org/elasticsearch/search/profile/SearchProfileResultsBuilder.java
<#77064 (comment)>
:
> +
+package org.elasticsearch.search.profile;
+
+import org.elasticsearch.search.SearchPhaseResult;
+import org.elasticsearch.search.fetch.FetchSearchResult;
+
+import java.util.Collection;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Profile results for the query phase run on all shards.
+ */
+public class SearchProfileResultsBuilder {
+ public static SearchProfileResultsBuilder build(Map<String, SearchProfileQueryPhaseResult> queryPhaseResults) {
Since this change has so many "moving parts" already, I'd be in favor of
keeping the old strategy to keep it well-scoped. I also find it easier to
understand. If we want we could follow-up with a refactor to simplify
SearchPhaseController a bit.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#77064 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXIU5N3HB3ODBAS2RE7TUBI6D5ANCNFSM5DEDX3OQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
OK! Back to you @jtibshirani. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
This adds profiling to the fetch phase so we can tell when fetching is slower than we'd like and we can tell which portion of the fetch is slow. The output includes which stored fields were loaded, how long it took to load stored fields, which fetch sub-phases were run, and how long those fetch sub-phases took. Closes elastic#75892
In elastic#77064 I added some more tests for profiling. It fails when a `rarely()` block hits - roughly 2% of the time. This should fix it.
In elastic#77064 I added some more tests for profiling. It fails when a `rarely()` block hits - roughly 2% of the time. This should fix it.
In #77064 I added some more tests for profiling. It fails when a `rarely()` block hits - roughly 2% of the time. This should fix it.
This adds profiling to the fetch phase so we can tell when fetching is slower than we'd like and we can tell which portion of the fetch is slow. The output includes which stored fields were loaded, how long it took to load stored fields, which fetch sub-phases were run, and how long those fetch sub-phases took. Closes #75892
Now that elastic#77064 is backported we can run the bwc tests against it. And we can speak to 7.16 nodes using the wire protocol addedin elastic#77064.
In and backported! Exciting! |
Heya, shall we update the limitations from the profile docs page now that fetch can be profiled too? https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html#profile-limitations |
I have no idea when, but someone's removed that "limitation". |
This adds profiling to the fetch phase so we can tell when fetching is
slower than we'd like and we can tell which portion of the fetch is
slow. The output includes which stored fields were loaded, how long it
took to load stored fields, which fetch sub-phases were run, and how
long those fetch sub-phases took.
Closes #75892