Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add broker API to run a query on both query engines and compare results #13746

Merged
merged 4 commits into from
Sep 30, 2024

Conversation

yashmayya
Copy link
Collaborator

  • This patch adds a broker API to run a query on both the single stage and the multi stage query engines and compare the results. It complements the migration metric added in Automatically detect whether a v1 query could have run on the v2 query engine #13628 in order to assist users that want to migrate from the v1 query engine to the v2 query engine.
  • This tool is a more "active" way of checking queries than the "passive" metric that only checks whether the v1 queries being run are compilable in v2.
  • Here, the query is actually run on both the query engines and the results are compared. Any differences such as a mismatch in the number of result rows, types of result columns etc. are explicitly called out to the user.

@yashmayya yashmayya added multi-stage Related to the multi-stage query engine rest-api labels Aug 5, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 64.19753% with 29 lines in your changes missing coverage. Please review.

Project coverage is 64.83%. Comparing base (59551e4) to head (f40583d).
Report is 1081 commits behind head on master.

Files with missing lines Patch % Lines
...pinot/broker/api/resources/PinotClientRequest.java 64.19% 18 Missing and 11 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13746      +/-   ##
============================================
+ Coverage     61.75%   64.83%   +3.08%     
- Complexity      207     1534    +1327     
============================================
  Files          2436     2579     +143     
  Lines        133233   141334    +8101     
  Branches      20636    21655    +1019     
============================================
+ Hits          82274    91635    +9361     
+ Misses        44911    42937    -1974     
- Partials       6048     6762     +714     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 64.78% <64.19%> (+3.07%) ⬆️
java-21 64.71% <64.19%> (+3.09%) ⬆️
skip-bytebuffers-false 64.80% <64.19%> (+3.06%) ⬆️
skip-bytebuffers-true 64.70% <64.19%> (+36.97%) ⬆️
temurin 64.83% <64.19%> (+3.08%) ⬆️
unittests 64.83% <64.19%> (+3.08%) ⬆️
unittests1 56.33% <ø> (+9.44%) ⬆️
unittests2 34.93% <64.19%> (+7.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review August 27, 2024 11:55
Copy link
Collaborator

@vrajat vrajat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be adding documentation in docs or as javadocs ? It will be useful to document the gaps or conditions the comparator does not check for. Caveats are useful as the pass/fail result doesnt guarantee that the queries are similar.

List<String> differences = new ArrayList<>();

if (v1Response.getExceptionsSize() != 0 || v2Response.getExceptionsSize() != 0) {
differences.add("Exception encountered while running the query on one or both query engines");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add exception messages to help out the user ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query response itself contains the entire v1 and v2 responses which will have the exceptions. It seems redundant to duplicate that here?

return differences;
}

DataSchema.ColumnDataType[] v1ResponseTypes = v1Response.getResultTable().getDataSchema().getColumnDataTypes();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to recognise cases where column ordering is different? I dont know if there are cases where the engines may return the result in different column orders.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any such cases where the column ordering differs in the two engines. However, we do check for column data type mismatches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR any simple select query without group by can return different results, even using the same engine. This should be more frequent if there are several segments involved in the query. In fact we would like to verify order if the query is order by and do not do that in the other case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column data type check will identify it. IME, if there is a query with 20 columns, there will be 20 messages in the differences array while the real problem is just that columns are ordered differently. Anyway - it is unclear if this issue exists and can be detected as an improvement in the future.

}
}

// TODO: Compare response row values if it makes sense for the query type. Handle edge cases with group trimming,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: document as a javadoc?

@@ -359,6 +359,8 @@ public static class Broker {

public static class Request {
public static final String SQL = "sql";
public static final String V1SQL = "v1sql";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use official names ? SSQE and MSQE. Though I vote to drop the Q in the abbrevations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official docs do use the v1 and v2 terminology as well - https://docs.pinot.apache.org/reference/single-stage-engine, https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine and it seems more succinct and clear to say v1sql than something like ssqeSql.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should always try to use single-stage query engine and multi-stage query engine instead of v1 and v2, but we are so used to the latest that sometimes we leak these terms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use only sse and mse as the semantics of the fields are obvious and will be documented. Sql prefix is not required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My order of preference in descending order is:

  1. sqlV1 and sqlV2 so they start with the same prefix of the default sql.
  2. v1Sql and v2Sql
  3. sse and mse sound a bit strange.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've changed it to sqlV1 / sqlV2.

Copy link
Collaborator Author

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be adding documentation in docs or as javadocs ? It will be useful to document the gaps or conditions the comparator does not check for. Caveats are useful as the pass/fail result doesnt guarantee that the queries are similar.

Agreed, I plan to add those details to the official documentation at https://docs.pinot.apache.org/.

List<String> differences = new ArrayList<>();

if (v1Response.getExceptionsSize() != 0 || v2Response.getExceptionsSize() != 0) {
differences.add("Exception encountered while running the query on one or both query engines");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query response itself contains the entire v1 and v2 responses which will have the exceptions. It seems redundant to duplicate that here?

return differences;
}

DataSchema.ColumnDataType[] v1ResponseTypes = v1Response.getResultTable().getDataSchema().getColumnDataTypes();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any such cases where the column ordering differs in the two engines. However, we do check for column data type mismatches.

@@ -359,6 +359,8 @@ public static class Broker {

public static class Request {
public static final String SQL = "sql";
public static final String V1SQL = "v1sql";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The official docs do use the v1 and v2 terminology as well - https://docs.pinot.apache.org/reference/single-stage-engine, https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine and it seems more succinct and clear to say v1sql than something like ssqeSql.

@POST
@Produces(MediaType.APPLICATION_JSON)
@Path("query/compare")
@ApiOperation(value = "Query Pinot using both the single stage query engine and the multi stage query engine and "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using swagger APIs quite a bit today. Setting sql vs v1Sql/v2Sql maybe confusing. Majority of users will only use the doc in the swagger page to use the API. There are a couple of options to reduce confusion:

  • Only provide v1Sql/v2Sql. Its not that hard to copy paste twice.
  • Change the one-line documentation to add more info. There is precedence to multi-line description in swagger page. For example:

Query Pinot using both the single stage query engine and the multi stage query engine and compare the results. Set sql field to run the same query in both engines. Set v1Sql & v2Sql instead if query text is different.

Copy link
Collaborator Author

@yashmayya yashmayya Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've gone with your option 2 since ideally we'd want most single-stage engine queries to work as is on the multi-stage query engine.

Comment on lines 265 to 323
if (requestJson.has(Request.SQL)) {
v1Query = requestJson.get(Request.SQL).asText();
v2Query = v1Query;
} else if (requestJson.has(Request.V1SQL) && requestJson.has(Request.V2SQL)) {
v1Query = requestJson.get(Request.V1SQL).asText();
v2Query = requestJson.get(Request.V2SQL).asText();
} else {
throw new IllegalStateException("Payload should either contain the query string field '" + Request.SQL + "' "
+ "or both of '" + Request.V1SQL + "' and '" + Request.V2SQL + "'");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support sql by default? I mean to still support sql when either sqlV1 or sqlV2 are provided.

So the algorithm I suggest is:

v1Query = requestJson.has(Request.V1SQL) ? requestJson.get(Request.V1SQL) else requestJson.get(Request.SQL)

and same for v2Query

This is similar to what we do in resource based query tests where there is a sql field we use by default but if there is a h2Sql we use that one for h2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've added this fallback although I haven't updated the API doc in order to avoid making it too confusing.

@gortiz gortiz merged commit 591f193 into apache:master Sep 30, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-stage Related to the multi-stage query engine rest-api
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants