-
Notifications
You must be signed in to change notification settings - Fork 181
Support parsing documents with flattened value #3577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parsing documents with flattened value #3577
Conversation
Signed-off-by: Heng Qian <qianheng@amazon.com>
|
|
||
| /** Implements mergeTo by merging deeply */ | ||
| @Override | ||
| public ExprTupleValue mergeTo(ExprValue base) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add UT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * | ||
| * <p>If there is existing vale for the JsonPath, we need to merge the new value to the old. | ||
| */ | ||
| private void populateValueRecursive(ExprTupleValue result, JsonPath path, ExprValue value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add UT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| @Override | ||
| public void init() throws Exception { | ||
| super.init(); | ||
| putDocument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yamlTest is more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or add this test data to resource folder and load it. It is not as simple as test index, adding the mapping and data to resource could be readable and reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@penghuo If we want to test this case w/wo Calcite enabled, in yamlTest, is there a convenient way to implement this with less code duplication?
In Java code, we can use inheritance to make it simple like the current code. But in yaml, we may need to duplicate the process with extra settings.
I've thought of using yamlTest but didn't have much information whether it can do such thing conveniently, so I keep using IT to do the verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a yaml test with only Calcite enabled.
| } | ||
| } | ||
|
|
||
| protected void putDocument(String index, int id, String json) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yamlTest support index any doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be deleted now?
| * @param base the target value to merge | ||
| * @return The merged value | ||
| */ | ||
| default ExprValue mergeTo(ExprValue base) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if ExprIpValue mergeTo ExprIntValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is replace. Could you comment in code what's the purpose of adding this mergeTo interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the doc, it should override the base under the current implementation. Such case will only happen when there is duplication value for a field.
- In normal cases, only ExprTupleValue::mergeTo should be invoked. Cases like:
{"log.json.time": 100, "log.json": { "status": "SUCCESS" }}
For "log.json.time": 100, it will be finally populated into an ExprValue ExprTupleValue(Map.of(\"log\": ExprTupleValue(Map.of({\"json\": ExprTupleValue(Map.of({\"time\": 100}))}))
For "log.json": { "status": "SUCCESS" }, it will be parsed into path: log.json, value: ExprTupleValue(Map.of({\"status\": SUCCESS}))
So after populating the first field, there is key conflict when populating the second field. In this case, we need to invoke ExprTupleValue::mergeTo` to merge 2 ExprTupleValue deeply.
- In some edge cases where there is conflict value in customer's document, like:
{"log.json.status": "FAILED", "log.json": { "status": "SUCCESS" }}
There is value conflict for the field log.json.status, so I want the value added most recently to override any previously added values, and ExprValue::megeTo will be invoked.
But, in json, since the sequence of the fields in the same level are unstable actually, we cannot assert which value will be put firstly. Maybe throw exception to tell customer there is incompatible value conflict will also be an option. What's your opinion on this edge case? @LantaoJin @penghuo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe throw exception to tell customer there is incompatible value conflict will also be an option.
Do you mean throw exception in ppl querying or data ingesting? Data ingesting seems out of the scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if there are conflict fields. Display the any of them would be all fine as long as the field populated out is fixed. For example:
{"id": 1, "log.json.status": "FAILED", "log.json": { "status": "SUCCESS" }}
{"id": 2, "log.json.status": "FAILED", "log.json": { "status": "SUCCESS" }}
{"id": 3, "log.json.status": "FAILED", "log.json": { "status": "SUCCESS" }}
PPL source=t | fields id, log.json.status return
1, FAILED
2, FAILED
3, FAILED
or
1, SUCCESS
2, SUCCESS
3, SUCCESS
are all acceptable. This could be treated as data quality issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for key conflict case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is value conflict case will flakey because fields order are not ensured in json. We cannot assert which value will be put in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the UT of testPopulateValueRecursive, where I can control the populate order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe throw exception to tell customer there is incompatible value conflict will also be an option. What's your opinion on this edge case?
do not throw exception, current approach looks good. OpenSearch also return the LATEST field.
###
POST {{baseUrl}}/test00001/_doc
Content-Type: application/x-ndjson
{
"log.json.status": "failed1",
"log.json": {
"status": "failed2"
},
"log": {
"json": {
"status": "failed3"
}
},
}
###
GET {{baseUrl}}/test00001/_search
Content-Type: application/x-ndjson
{
"fields": [
"log.json.status"
],
"_source": false
}
### Return
"fields": {
"log.json.status": [
"failed3"
]
}
| // Update the current ExprValue by using mergeTo if exists | ||
| result | ||
| .tupleValue() | ||
| .computeIfPresent(path.getRootPath(), (key, oldValue) -> value.mergeTo(oldValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case is coverred by mergeTo, same field with different value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment [here](#3577 (comment))
...earch/src/main/java/org/opensearch/sql/opensearch/data/value/OpenSearchExprValueFactory.java
Show resolved
Hide resolved
| @Override | ||
| public void init() throws Exception { | ||
| super.init(); | ||
| putDocument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or add this test data to resource folder and load it. It is not as simple as test index, adding the mapping and data to resource could be readable and reusable.
| * @param base the target value to merge | ||
| * @return The merged value | ||
| */ | ||
| default ExprValue mergeTo(ExprValue base) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is replace. Could you comment in code what's the purpose of adding this mergeTo interface?
...earch/src/main/java/org/opensearch/sql/opensearch/data/value/OpenSearchExprValueFactory.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
|
@qianheng-aws could you change this log level from |
Signed-off-by: Heng Qian <qianheng@amazon.com>
| * @param base the target value to merge | ||
| * @return The merged value | ||
| */ | ||
| default ExprValue mergeTo(ExprValue base) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe throw exception to tell customer there is incompatible value conflict will also be an option. What's your opinion on this edge case?
do not throw exception, current approach looks good. OpenSearch also return the LATEST field.
###
POST {{baseUrl}}/test00001/_doc
Content-Type: application/x-ndjson
{
"log.json.status": "failed1",
"log.json": {
"status": "failed2"
},
"log": {
"json": {
"status": "failed3"
}
},
}
###
GET {{baseUrl}}/test00001/_search
Content-Type: application/x-ndjson
{
"fields": [
"log.json.status"
],
"_source": false
}
### Return
"fields": {
"log.json.status": [
"failed3"
]
}
* Support parsing documents with flattened value Signed-off-by: Heng Qian <qianheng@amazon.com> * Add UT and yamlTest Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix UT by using LinkedHashMap to ensure order Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix UT Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix IT Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix log level for JdbcOpenSearchDataTypeConvertor Signed-off-by: Heng Qian <qianheng@amazon.com> --------- Signed-off-by: Heng Qian <qianheng@amazon.com> (cherry picked from commit 8d531e0) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Support parsing documents with flattened value * Add UT and yamlTest * Refine code * Refine code * Refine code * Fix UT by using LinkedHashMap to ensure order * Address comments * Fix UT * Fix IT * Address comments * Address comments * Fix log level for JdbcOpenSearchDataTypeConvertor --------- (cherry picked from commit 8d531e0) Signed-off-by: Heng Qian <qianheng@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Support parsing documents with flattened value Signed-off-by: Heng Qian <qianheng@amazon.com> (cherry picked from commit 8d531e0)
* Support parsing documents with flattened value Signed-off-by: Heng Qian <qianheng@amazon.com> (cherry picked from commit 8d531e0)
* Support parsing documents with flattened value Signed-off-by: Heng Qian <qianheng@amazon.com> Signed-off-by: Kai Huang <ahkcs@amazon.com> (cherry picked from commit 8d531e0)
* Support parsing documents with flattened value Signed-off-by: Heng Qian <qianheng@amazon.com> * Add UT and yamlTest Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Refine code Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix UT by using LinkedHashMap to ensure order Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix UT Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix IT Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Address comments Signed-off-by: Heng Qian <qianheng@amazon.com> * Fix log level for JdbcOpenSearchDataTypeConvertor Signed-off-by: Heng Qian <qianheng@amazon.com> --------- Signed-off-by: Heng Qian <qianheng@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com>
Description
It supports parsing flattened document value by populating the ExprTupleValue recursively, so we can access the nested fields as well as the root fields as before.
Related Issues
Resolves #3477
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.