Skip to content

Add customizable parser module#8484

Merged
xiangfu0 merged 12 commits intoapache:masterfrom
walterddr:add_custom_parser
Apr 13, 2022
Merged

Add customizable parser module#8484
xiangfu0 merged 12 commits intoapache:masterfrom
walterddr:add_custom_parser

Conversation

@walterddr
Copy link
Contributor

@walterddr walterddr commented Apr 7, 2022

This PR added a customizable parser syntax extension.

Details:

  • created fmpp-maven-plugin that supports multi scope config template.
    • supported overwrite config.fmpp over Calcite's default.
  • supported customizable syntax via extending parserImpls.ftl
  • incorporated BABEL syntax Pinot used
    • need to reduce the amount of nonReservedKeywords but we kept the full list for backward compatibility
  • introduced a new SqlInsertFromTable node.

Comment on lines +46 to +49
/**
* INSERT INTO [db_name.]table_name
* FROM [ FILE | ARCHIVE ] 'file_uri' [, [ FILE | ARCHIVE ] 'file_uri' ]
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia
Copy link
Contributor

Don't we need to write new handlers (java code) for the new grammar ?

@xiangfu0
Copy link
Contributor

xiangfu0 commented Apr 7, 2022

Don't we need to write new handlers (java code) for the new grammar ?

Yes, I'm on the java part.

We plan to add the query syntax to insert raw data files into pinot using the query console. See the examples here: #8465 (comment)

@walterddr is extending the Calcite dialect to support new syntax then we can hook this with parser logic to this new ad-hoc minion task endpoint to execute tasks.

@walterddr
Copy link
Contributor Author

this is still in POC phase but we were able to incorporate a new syntax and a new SqlCall node during parsing phase.

caveat. no longer using BABEL parser factory. let's see how many tests failed b/c of this change

Copy link
Contributor Author

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR:

@codecov-commenter
Copy link

codecov-commenter commented Apr 9, 2022

Codecov Report

Merging #8484 (0c1ba30) into master (52e5a2c) will decrease coverage by 3.77%.
The diff coverage is 63.83%.

❗ Current head 0c1ba30 differs from pull request most recent head ae4d2ec. Consider uploading reports for the commit ae4d2ec to get more accurate results

@@             Coverage Diff              @@
##             master    #8484      +/-   ##
============================================
- Coverage     70.77%   67.00%   -3.78%     
+ Complexity     4281     4202      -79     
============================================
  Files          1664     1274     -390     
  Lines         87356    64076   -23280     
  Branches      13229    10053    -3176     
============================================
- Hits          61826    42931   -18895     
+ Misses        21234    18062    -3172     
+ Partials       4296     3083    -1213     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 67.00% <63.83%> (-0.04%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...pache/pinot/common/config/provider/TableCache.java 0.00% <0.00%> (-76.20%) ⬇️
...inot/common/function/scalar/DateTimeFunctions.java 95.23% <ø> (ø)
.../apache/pinot/sql/parsers/parser/UnparseUtils.java 0.00% <0.00%> (ø)
.../org/apache/pinot/core/common/MinionConstants.java 0.00% <ø> (ø)
...in/java/org/apache/pinot/core/common/Operator.java 0.00% <ø> (ø)
...inot/core/data/manager/offline/DimensionTable.java 90.90% <0.00%> (-9.10%) ⬇️
...ata/manager/offline/DimensionTableDataManager.java 84.61% <0.00%> (-1.66%) ⬇️
...operator/AcquireReleaseColumnsSegmentOperator.java 0.00% <ø> (ø)
...he/pinot/core/operator/BitmapDocIdSetOperator.java 71.42% <ø> (+3.24%) ⬆️
...g/apache/pinot/core/operator/DocIdSetOperator.java 96.29% <ø> (+3.43%) ⬆️
... and 716 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52e5a2c...ae4d2ec. Read the comment docs.

</java>
</configuration>
</plugin>
<plugin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to sync the pom with pinot-common-jdk8 module

@xiangfu0
Copy link
Contributor

xiangfu0 commented Apr 9, 2022

updated PR:

we can move fmpp to org: https://github.com/pinot-contrib ?

@walterddr walterddr changed the title adding custome parser adding customized parser Apr 10, 2022
@walterddr walterddr changed the title adding customized parser Add customizable parser module Apr 10, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: Do we introduce this custom parser to accelerate the parsing but keep the same preserved words as BABEL? Can we get the same performance if switching to DEFAULT? I feel it might not worth the maintenance overhead if it is just for some extra preserved words.
If it is a must have in order to support more custom query syntax support, then go for it

+ " ELSE 'The quantity is under 30'\n"
+ "END AS QuantityText\n"
+ "FROM OrderDetails");
"SELECT OrderID, Quantity,\n" + "CASE\n" + " WHEN Quantity > 30 THEN 'The quantity is greater than 30'\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(format) Please configure the formatter to honor formatter comments (Code Style -> Formatter -> check Turn formatter on/off with markers in code comments)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@walterddr walterddr marked this pull request as ready for review April 12, 2022 00:27
@xiangfu0
Copy link
Contributor

High level question: Do we introduce this custom parser to accelerate the parsing but keep the same preserved words as BABEL? Can we get the same performance if switching to DEFAULT? I feel it might not worth the maintenance overhead if it is just for some extra preserved words. If it is a must have in order to support more custom query syntax support, then go for it

The main goal is to introduce new SQL syntax for INSERT INTO ...

@walterddr
Copy link
Contributor Author

walterddr commented Apr 12, 2022

High level question: Do we introduce this custom parser to accelerate the parsing but keep the same preserved words as BABEL?

the goal is not accelerate parsing but to get precise control on our SQL syntax

  • to support customizable syntax this is required.
  • BABEL enabled lots of unnecessary non-reserved keyword that may or may not be intended (and so far we only tested 10~20ish such non-reserved keyword usages).

yes, we copied the whole list for backward compatibility but ideally speaking we should only kept the list of non-reserved keywords that we support in CalciteSqlCompilerTest:testReservedKeywords()

Can we get the same performance if switching to DEFAULT?

Switching to DEFAULT or MYSQL_5 standard is desired if possible (see my TODO comment), but we still need to customized the list of non-reserved keyword + maintain backward compatibility until we can warn users that certain reserved keywords (in BABEL) will be removed in the future (if we decided to do so)

I feel it might not worth the maintenance overhead if it is just for some extra preserved words.

there shouldn't be any additional maintenance overhead comparing with using BABEL. As the syntax codegen rarely changed. and it will remain functional unless we change calcite version (which will causes maintenance overhead anyway if use BABEL)

}
return report;
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

+ " ELSE 'The quantity is under 30'\n"
+ "END AS QuantityText\n"
+ "FROM OrderDetails");
"SELECT OrderID, Quantity,\n" + "CASE\n" + " WHEN Quantity > 30 THEN 'The quantity is greater than 30'\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@Test
public void testParserExtensionImpl() {
SqlNode sqlNode = testSqlWithCustomSqlParser("INSERT INTO db.tbl "
+ "FROM FILE 'file:///tmp/file1', FILE 'file:///tmp/file2'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add a test for INSERT with OPTIONS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you give an example? (i am not sure that's supported in my parser syntax)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are referring to the option regex. that is not part of the change and will be removed from the SQL string before entering the parser anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to something like:

INSERT INTO "baseballStats"
FROM FILE s3://<my-bucket>/baseballStats/rawdata/
OPTION(taskName=myTask-s3)
OPTION(input.fs.className=org.apache.pinot.plugin.filesystem.S3PinotFS)
OPTION(input.fs.prop.accessKey=<my-key>)
OPTION(input.fs.prop.secretKey=<my-secret>)
OPTION(input.fs.prop.region=us-west-2)

If it's already handled before then no problem from here.

Copy link
Contributor

@siddharthteotia siddharthteotia Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we are still using the BABEL conformance but removing the MYSQL lex analyzer ? There won't be any impact of this on existing queries that are running fine or are otherwise unsupported ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lex is different from conformance. Lex basically is equivalent to the line

    sqlParser.switchTo(SqlAbstractParserImpl.LexicalState.DQID);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. My concern was mainly around ifLex.MYSQL_ANSI will give same behavior as SqlAbstractParserImpl.LexicalState.DQID ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are. If you check the config impl behind the scene it uses the LEX enum to see a whole bunch of settings I did here as well

Copy link
Contributor

@siddharthteotia siddharthteotia Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the current code SqlParser sqlParser = SqlParser.create(expression, PARSER_CONFIG); will return SqlBabelParserImpl (created from SqlBabelParserImplFactory).

In this PR, I think we are going to use SqlParserImpl (SqlParserImplFactory) instead of SqlBabelParserImpl and that is going to give the same behavior as today because SqlConformance is still set to BABEL ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partially, also because we copied the BABEL config.fmpp content over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) brief javadoc please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) javadoc please ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all cases covered by the new test in CalciteSqlCompilerTest.java ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. (actually there's one only, no alternative/branch)

@xiangfu0 xiangfu0 merged commit cb8bcc4 into apache:master Apr 13, 2022
@walterddr walterddr deleted the add_custom_parser branch December 6, 2023 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants