-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add broker API to run a query on both query engines and compare results #13746
Conversation
yashmayya
commented
Aug 5, 2024
- This patch adds a broker API to run a query on both the single stage and the multi stage query engines and compare the results. It complements the migration metric added in Automatically detect whether a v1 query could have run on the v2 query engine #13628 in order to assist users that want to migrate from the v1 query engine to the v2 query engine.
- This tool is a more "active" way of checking queries than the "passive" metric that only checks whether the v1 queries being run are compilable in v2.
- Here, the query is actually run on both the query engines and the results are compared. Any differences such as a mismatch in the number of result rows, types of result columns etc. are explicitly called out to the user.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13746 +/- ##
============================================
+ Coverage 61.75% 64.83% +3.08%
- Complexity 207 1534 +1327
============================================
Files 2436 2579 +143
Lines 133233 141334 +8101
Branches 20636 21655 +1019
============================================
+ Hits 82274 91635 +9361
+ Misses 44911 42937 -1974
- Partials 6048 6762 +714
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0d58ced
to
485c9dd
Compare
485c9dd
to
f00bfae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you be adding documentation in docs or as javadocs ? It will be useful to document the gaps or conditions the comparator does not check for. Caveats are useful as the pass/fail result doesnt guarantee that the queries are similar.
List<String> differences = new ArrayList<>(); | ||
|
||
if (v1Response.getExceptionsSize() != 0 || v2Response.getExceptionsSize() != 0) { | ||
differences.add("Exception encountered while running the query on one or both query engines"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add exception messages to help out the user ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query response itself contains the entire v1 and v2 responses which will have the exceptions. It seems redundant to duplicate that here?
return differences; | ||
} | ||
|
||
DataSchema.ColumnDataType[] v1ResponseTypes = v1Response.getResultTable().getDataSchema().getColumnDataTypes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to recognise cases where column ordering is different? I dont know if there are cases where the engines may return the result in different column orders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any such cases where the column ordering differs in the two engines. However, we do check for column data type mismatches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR any simple select query without group by can return different results, even using the same engine. This should be more frequent if there are several segments involved in the query. In fact we would like to verify order if the query is order by and do not do that in the other case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column data type check will identify it. IME, if there is a query with 20 columns, there will be 20 messages in the differences
array while the real problem is just that columns are ordered differently. Anyway - it is unclear if this issue exists and can be detected as an improvement in the future.
} | ||
} | ||
|
||
// TODO: Compare response row values if it makes sense for the query type. Handle edge cases with group trimming, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: document as a javadoc?
@@ -359,6 +359,8 @@ public static class Broker { | |||
|
|||
public static class Request { | |||
public static final String SQL = "sql"; | |||
public static final String V1SQL = "v1sql"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use official names ? SSQE
and MSQE
. Though I vote to drop the Q
in the abbrevations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official docs do use the v1 and v2 terminology as well - https://docs.pinot.apache.org/reference/single-stage-engine, https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine and it seems more succinct and clear to say v1sql
than something like ssqeSql
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should always try to use single-stage query engine and multi-stage query engine instead of v1 and v2, but we are so used to the latest that sometimes we leak these terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use only sse
and mse
as the semantics of the fields are obvious and will be documented. Sql
prefix is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My order of preference in descending order is:
sqlV1
andsqlV2
so they start with the same prefix of the defaultsql
.v1Sql
andv2Sql
sse
andmse
sound a bit strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've changed it to sqlV1
/ sqlV2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you be adding documentation in docs or as javadocs ? It will be useful to document the gaps or conditions the comparator does not check for. Caveats are useful as the pass/fail result doesnt guarantee that the queries are similar.
Agreed, I plan to add those details to the official documentation at https://docs.pinot.apache.org/.
List<String> differences = new ArrayList<>(); | ||
|
||
if (v1Response.getExceptionsSize() != 0 || v2Response.getExceptionsSize() != 0) { | ||
differences.add("Exception encountered while running the query on one or both query engines"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query response itself contains the entire v1 and v2 responses which will have the exceptions. It seems redundant to duplicate that here?
return differences; | ||
} | ||
|
||
DataSchema.ColumnDataType[] v1ResponseTypes = v1Response.getResultTable().getDataSchema().getColumnDataTypes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any such cases where the column ordering differs in the two engines. However, we do check for column data type mismatches.
@@ -359,6 +359,8 @@ public static class Broker { | |||
|
|||
public static class Request { | |||
public static final String SQL = "sql"; | |||
public static final String V1SQL = "v1sql"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The official docs do use the v1 and v2 terminology as well - https://docs.pinot.apache.org/reference/single-stage-engine, https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine and it seems more succinct and clear to say v1sql
than something like ssqeSql
.
@POST | ||
@Produces(MediaType.APPLICATION_JSON) | ||
@Path("query/compare") | ||
@ApiOperation(value = "Query Pinot using both the single stage query engine and the multi stage query engine and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using swagger APIs quite a bit today. Setting sql
vs v1Sql/v2Sql
maybe confusing. Majority of users will only use the doc in the swagger page to use the API. There are a couple of options to reduce confusion:
- Only provide
v1Sql
/v2Sql
. Its not that hard to copy paste twice. - Change the one-line documentation to add more info. There is precedence to multi-line description in swagger page. For example:
Query Pinot using both the single stage query engine and the multi stage query engine and compare the results. Set sql field to run the same query in both engines. Set v1Sql & v2Sql instead if query text is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've gone with your option 2 since ideally we'd want most single-stage engine queries to work as is on the multi-stage query engine.
f00bfae
to
8f21e2a
Compare
if (requestJson.has(Request.SQL)) { | ||
v1Query = requestJson.get(Request.SQL).asText(); | ||
v2Query = v1Query; | ||
} else if (requestJson.has(Request.V1SQL) && requestJson.has(Request.V2SQL)) { | ||
v1Query = requestJson.get(Request.V1SQL).asText(); | ||
v2Query = requestJson.get(Request.V2SQL).asText(); | ||
} else { | ||
throw new IllegalStateException("Payload should either contain the query string field '" + Request.SQL + "' " | ||
+ "or both of '" + Request.V1SQL + "' and '" + Request.V2SQL + "'"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support sql
by default? I mean to still support sql
when either sqlV1
or sqlV2
are provided.
So the algorithm I suggest is:
v1Query = requestJson.has(Request.V1SQL) ? requestJson.get(Request.V1SQL) else requestJson.get(Request.SQL)
and same for v2Query
This is similar to what we do in resource based query tests where there is a sql
field we use by default but if there is a h2Sql
we use that one for h2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I've added this fallback although I haven't updated the API doc in order to avoid making it too confusing.
df41491
to
cf5bafa
Compare
cf5bafa
to
269f934
Compare
269f934
to
f40583d
Compare