Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DML definition and parse SQL InsertFile #8557

Merged
merged 3 commits into from
Apr 30, 2022

Conversation

xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Apr 16, 2022

Adding DML definition and Support INSERT INTO table FROM inputDirURI statement

  1. Parse the query with the table name and directory URI along with a list of options for the ingestion job.
  2. Call controller minion task execution API endpoint to schedule the task on minion
  3. Response has the schema of table name and task job id.

Syntax:

INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]*

E.g.

INSERT INTO "baseballStats"
FROM FILE 's3://my-bucket/public_data_set/baseballStats/rawdata/'
OPTION(taskName=myTask-s3)
OPTION(input.fs.className=org.apache.pinot.plugin.filesystem.S3PinotFS)
OPTION(input.fs.prop.accessKey=my-key)
OPTION(input.fs.prop.secretKey=my-secret)
OPTION(input.fs.prop.region=us-west-2)

Sample:
image

@xiangfu0 xiangfu0 marked this pull request as draft April 16, 2022 10:39
@codecov-commenter
Copy link

codecov-commenter commented Apr 16, 2022

Codecov Report

Merging #8557 (0b57e82) into master (fc12dce) will increase coverage by 2.85%.
The diff coverage is 42.24%.

@@             Coverage Diff              @@
##             master    #8557      +/-   ##
============================================
+ Coverage     63.60%   66.45%   +2.85%     
+ Complexity     4322     4235      -87     
============================================
  Files          1648     1287     -361     
  Lines         86917    64697   -22220     
  Branches      13265    10139    -3126     
============================================
- Hits          55281    42996   -12285     
+ Misses        27595    18575    -9020     
+ Partials       4041     3126     -915     
Flag Coverage Δ
unittests1 66.45% <42.24%> (-0.06%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...a/org/apache/pinot/common/minion/MinionClient.java 60.37% <0.00%> (-17.68%) ⬇️
...e/pinot/common/minion/MinionRequestURLBuilder.java 34.78% <0.00%> (-1.59%) ⬇️
...l/parsers/dml/DataManipulationStatementParser.java 0.00% <0.00%> (ø)
...inot/core/query/executor/sql/SqlQueryExecutor.java 0.00% <0.00%> (ø)
...va/org/apache/pinot/spi/utils/CommonConstants.java 21.31% <ø> (ø)
...org/apache/pinot/sql/parsers/CalciteSqlParser.java 85.77% <63.15%> (-1.08%) ⬇️
...g/apache/pinot/sql/parsers/dml/InsertIntoFile.java 85.18% <85.18%> (ø)
...ava/org/apache/pinot/sql/parsers/PinotSqlType.java 100.00% <100.00%> (ø)
...rg/apache/pinot/sql/parsers/SqlNodeAndOptions.java 100.00% <100.00%> (ø)
...not/sql/parsers/dml/DataManipulationStatement.java 100.00% <100.00%> (ø)
... and 372 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc12dce...0b57e82. Read the comment docs.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 4 times, most recently from f955d3b to 0cee37f Compare April 16, 2022 21:26
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

httpHeaders.getRequestHeaders().entrySet().stream().filter(entry -> !entry.getValue().isEmpty())
.map(entry -> Pair.of(entry.getKey(), entry.getValue().get(0)))
.collect(Collectors.toMap(Pair::getKey, Pair::getValue));
return _sqlQueryExecutor.executeDMLStatement(sqlNodeAndOptions, headers).toJsonString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is executed on controller, we can skip sending the request again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DML, DDL statements are not latency-sensitive statements, so it might be ok to have an extra route but keep the code path clean.

@xiangfu0 xiangfu0 added release-notes Referenced by PRs that need attention when compiling the next release notes feature labels Apr 16, 2022
@xiangfu0
Copy link
Contributor Author

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

For most of the DB, there would be just only one endpoint for executing all statements. So I feel we should just extend the existing SQL endpoint.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 5 times, most recently from 6bf3bd1 to e457243 Compare April 18, 2022 08:07
Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.
although I am not sure it is good to encode aws key/token in plain text. any chance we can do it differently?

Comment on lines +219 to +225
case DQL:
return _requestHandler.handleRequest(sqlRequestJson, httpRequesterIdentity, new RequestStatistics());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means all parsing done so far is not used, instead sending the stuff to broker and parse it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. If we want to reuse the SqlNode, it requires refactoring to add it into requestHandler's interface, which I don't want to change for now.

sql = removeTerminatingSemicolon(sql);

// Extract OPTION statements from sql as Calcite Parser doesn't parse it.
List<String> options = extractOptionsFromSql(sql);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cleaned up by extending parser.jj to allow OPTION keyword but that can be done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since we already have the OPTIONS handler and be able to parse it out, then I don't complain.
Eventually, this should all go to Parser.

@xiangfu0
Copy link
Contributor Author

looks good. although I am not sure it is good to encode aws key/token in plain text. any chance we can do it differently?

Users will definitely type it in plain text and send it over using the query console. We should internal mask those, similar to current recurring minion tasks.

@xiangfu0 xiangfu0 marked this pull request as ready for review April 19, 2022 23:12
@amrishlal
Copy link
Contributor

Looks good, but would be useful to have a few unit and / or integration tests. Also, AccessControl.hasAccess(...) methods are currently being called in BaseBrokerREquestHandler:

Line 189:
    // First-stage access control to prevent unauthenticated requests from using up resources. Secondary table-level
    // check comes later.
    boolean hasAccess = _accessControlFactory.create().hasAccess(requesterIdentity);

Line 295:
    // Second-stage table-level access control
    boolean hasTableAccess = _accessControlFactory.create().hasAccess(requesterIdentity, serverBrokerRequest);

I am wondering if something similar can be done for all DML/DDL statements as well to avoid either accidental or malicious usecases where a user who normally runs SELECT statements ends up running an INSERT statement with valid S3 credential to insert bad data into a Pinot table (?).

@jackjlli
Copy link
Member

I wonder if LOAD DATA makes more sense, as INSERT often follows with a list of concrete values.

Here is the reference from MySQL:
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
https://dev.mysql.com/doc/refman/8.0/en/insert.html

@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Apr 20, 2022

I wonder if LOAD DATA makes more sense, as INSERT often follows with a list of concrete values.

Here is the reference from MySQL: https://dev.mysql.com/doc/refman/8.0/en/load-data.html https://dev.mysql.com/doc/refman/8.0/en/insert.html

Agreed, using LOAD DATA makes more sense.

I was trying to be on par with other technologies (https://clickhouse.com/docs/en/sql-reference/statements/insert-into/#inserting-data-from-a-file). We can also support similar semantics, but LOAD DATA should be the right syntax to support and in the documentation.

cc: @walterddr

@xiangfu0
Copy link
Contributor Author

I wonder if LOAD DATA makes more sense, as INSERT often follows with a list of concrete values.

Here is the reference from MySQL: https://dev.mysql.com/doc/refman/8.0/en/load-data.html https://dev.mysql.com/doc/refman/8.0/en/insert.html

Will update PR to change to LOAD DATA semantics.

@walterddr
Copy link
Contributor

walterddr commented Apr 20, 2022

I think INSERT INTO makes more sense for several reasons

  1. there's no ANSI standard for loading data from static objects; it is a DML specific to a database system.
  2. INSERT INTO FROM xx supports default SqlInsert syntactic extension from INSERT statement which is a type of extensible statement similar to CREATE xxx is extensible statement as DDL; I don't think there's a statement extension from LOAD xxx keyword.
  3. INSERT INTO (SELECT * FROM MYTABLE) is actually a valid syntax in many databases

[edit] --> just to clarify I wasn't opposing adding LOAD FILE, it can very much be added to the DML syntax as another alternative way to write SQL to achieve the same. as long as the syntax is modeled the same way as this PR did

@walterddr
Copy link
Contributor

to add to it. LOAD FILE seems to be a MySQL only standard.
for example PostgresDB:

COPY table_name [ ( column_name [, ...] ) ]
    FROM { 'filename' | PROGRAM 'command' | STDIN }
    [ [ WITH ] ( option [, ...] ) ]
    [ WHERE condition ]

where the FROM { 'filename' } part represents an external table from a file.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 3 times, most recently from c11373c to 4db99f1 Compare April 25, 2022 03:50
@xiangfu0
Copy link
Contributor Author

I wonder if LOAD DATA makes more sense, as INSERT often follows with a list of concrete values.

Here is the reference from MySQL: https://dev.mysql.com/doc/refman/8.0/en/load-data.html https://dev.mysql.com/doc/refman/8.0/en/insert.html

Talked with @walterddr offline and feel we can keep the current syntax for now. We can support MySQL syntax later.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 2 times, most recently from 7321e78 to 7db9c13 Compare April 25, 2022 08:15
@Jackie-Jiang
Copy link
Contributor

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

For most of the DB, there would be just only one endpoint for executing all statements. So I feel we should just extend the existing SQL endpoint.

Agree that we should use one single endpoint to handle all statements. My concern is that the single SQL endpoint should probably be a POST request instead of the current GET request. I'd suggest adding a new POST endpoint for all statements, and keep the current GET endpoint which can only be used to query data

@xiangfu0
Copy link
Contributor Author

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

For most of the DB, there would be just only one endpoint for executing all statements. So I feel we should just extend the existing SQL endpoint.

Agree that we should use one single endpoint to handle all statements. My concern is that the single SQL endpoint should probably be a POST request instead of the current GET request. I'd suggest adding a new POST endpoint for all statements, and keep the current GET endpoint which can only be used to query data

GET endpoint is not used much, so far I think only pinot-admin Query command is still using GET api, all the other endpoint are using POST.

For endpoint perspective, do you mean we should create a new POST endpoint, say /sql/v2 for all SQL semantics and keep existing endpoint just for DQL?

@Jackie-Jiang
Copy link
Contributor

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

For most of the DB, there would be just only one endpoint for executing all statements. So I feel we should just extend the existing SQL endpoint.

Agree that we should use one single endpoint to handle all statements. My concern is that the single SQL endpoint should probably be a POST request instead of the current GET request. I'd suggest adding a new POST endpoint for all statements, and keep the current GET endpoint which can only be used to query data

GET endpoint is not used much, so far I think only pinot-admin Query command is still using GET api, all the other endpoint are using POST.

For endpoint perspective, do you mean we should create a new POST endpoint, say /sql/v2 for all SQL semantics and keep existing endpoint just for DQL?

Oh, just realize we already have the POST endpoint. What I meant is to make POST (reusing the existing one) to accept all SQL semantics, and keep GET only accepting DQL. We may add a check in the GET endpoint to prevent user using GET to change the table.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 3 times, most recently from 6c6e9ee to 34987e5 Compare April 28, 2022 19:45
@xiangfu0
Copy link
Contributor Author

High level question: should we reuse the current SQL endpoint (HTTP GET), or introduce a new one. How do other systems handle it?

For most of the DB, there would be just only one endpoint for executing all statements. So I feel we should just extend the existing SQL endpoint.

Agree that we should use one single endpoint to handle all statements. My concern is that the single SQL endpoint should probably be a POST request instead of the current GET request. I'd suggest adding a new POST endpoint for all statements, and keep the current GET endpoint which can only be used to query data

GET endpoint is not used much, so far I think only pinot-admin Query command is still using GET api, all the other endpoint are using POST.
For endpoint perspective, do you mean we should create a new POST endpoint, say /sql/v2 for all SQL semantics and keep existing endpoint just for DQL?

Oh, just realize we already have the POST endpoint. What I meant is to make POST (reusing the existing one) to accept all SQL semantics, and keep GET only accepting DQL. We may add a check in the GET endpoint to prevent user using GET to change the table.

done, throws exception when it's not DQL for GET api.

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 4 times, most recently from 0b57e82 to 4ec4d4e Compare April 29, 2022 23:30
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiangfu0 xiangfu0 force-pushed the insert_into_dml branch 3 times, most recently from 1b81234 to 682f25a Compare April 30, 2022 02:44
@xiangfu0 xiangfu0 merged commit b5690a7 into apache:master Apr 30, 2022
@xiangfu0 xiangfu0 deleted the insert_into_dml branch April 30, 2022 10:57
WangCHX pushed a commit to WangCHX/pinot that referenced this pull request May 5, 2022
* Adding DML definition and parse sql

* Use MinionClient for execute async task

* Add integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants