Skip to content

[SPARK-31102][SQL] Spark-sql fails to parse when contains comment. #27920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

javierivanov
Copy link
Contributor

@javierivanov javierivanov commented Mar 16, 2020

What changes were proposed in this pull request?

This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

Why are the changes needed?

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:

spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query: 
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:

spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^

This issue is generated by a missing turn-off for the insideComment flag with a newline.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • For previous tests using line-continuity(\) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
  • A new test for inline comments was added.

@maropu
Copy link
Member

maropu commented Mar 16, 2020

ok to test

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119825 has finished for PR 27920 at commit d69d271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""SELECT concat('test', 'comment') -- someone's comment here \\
| comment continues here with single ' quote \\
| extra ' \\
| ;""".stripMargin -> "testcomment"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you did you remove the existing tests instead of adding new tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @maropu !
The SQL parser does not recognize line-continuity per se.

scala> sql(s"""SELECT concat('test', 'comment') -- someone's comment here \\\ncomment continues here with single ' quote \\\nextra ' \\""")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'continues' expecting {<EOF>, ',', 'CLUSTER', 'DISTRIBUTE', 'EXCEPT', 'FROM', 'GROUP', 'HAVING', 'INTERSECT', 'LATERAL', 'LIMIT', 'ORDER', 'MINUS', 'SORT', 'UNION', 'WHERE', 'WINDOW', '-'}(line 2, pos 8)

== SQL ==
SELECT concat('test', 'comment') -- someone's comment here \
comment continues here with single ' quote \
--------^^^
extra ' \

It works just fine for inline comments included backslash:

scala> sql(s"""SELECT concat('test', 'comment') -- someone's comment here \\\n,2""") show
+---------------------+---+
|concat(test, comment)|  2|
+---------------------+---+
|          testcomment|  2|
+---------------------+---+

But does not work outside the inline comment(the backslash):

 sql(s"""SELECT concat('test', 'comment') -- someone's comment here \n,2\\\n""")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '\' expecting <EOF>(line 2, pos 2)

== SQL ==
SELECT concat('test', 'comment') -- someone's comment here
,2\
--^^^

Previously worked fine because of this very bug, the insideComment flag ignored everything until the end of the string. But the spark SQL parser does not recognize the backslashes. Line-continuity can be added to the CLI. But I think that feature should be added directly to the SQL parser to avoid confusion.

Let me know your thoughts 👍

Copy link
Member

@maropu maropu Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can, the fix in SqlBase.g4 (SIMPLE_COMENT) looks fine to me and I think the queries above should work in Spark SQL: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L1811 Could you try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I have added the fix. Let me know what you think :)

@maropu
Copy link
Member

maropu commented Mar 16, 2020

cc: @wangyum

@maropu
Copy link
Member

maropu commented Apr 9, 2020

@javierivanov Any update?

@javierivanov
Copy link
Contributor Author

@maropu I am extremly sorry, I will commit soon :)

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121162 has finished for PR 27920 at commit 440dcbd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 13, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121181 has finished for PR 27920 at commit 440dcbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121211 has finished for PR 27920 at commit 0571f21.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 14, 2020

retest this please

@maropu
Copy link
Member

maropu commented Apr 14, 2020

Could you check this? @wangyum

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121243 has finished for PR 27920 at commit 0571f21.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 14, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121260 has finished for PR 27920 at commit 0571f21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@javierivanov
Copy link
Contributor Author

ping @wangyum @maropu :)

@@ -1814,7 +1814,7 @@ fragment LETTER
;

SIMPLE_COMMENT
: '--' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN)
: '--' ('\\\n' | ~[\r\n])* '\r'? '\n'? -> channel(HIDDEN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, one more comment; could you add tests in sql-tests/inputs/comments.sql, too?

@maropu
Copy link
Member

maropu commented May 6, 2020

retest this please

@maropu
Copy link
Member

maropu commented May 6, 2020

cc: @cloud-fan

runCliWithin(1.minute)(
"""SELECT concat('test', 'comment') -- someone's comment here \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so double-slash doesn't work any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a previous mistake since using Scala multi-line strings it auto escape chars.

val plan = table("a").select(star())
assertEqual("-- single comment\nSELECT * FROM a", plan)
}

test("single comment case two") {
val plan = table("a").select(star())
assertEqual("-- single comment\\\nwith line continuity\nSELECT * FROM a", plan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to interpret \\\n? An escaped slash and a new-line symbol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats correct. Inline strings need to be escaped.

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122383 has finished for PR 27920 at commit 0571f21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 12, 2020

thanks, merging to master!

@cloud-fan cloud-fan closed this in 178ca96 May 12, 2020
@cloud-fan
Copy link
Contributor

it conflicts with 3.0, @javierivanov can you open a new PR for 3.0? Thanks!

@maropu
Copy link
Member

maropu commented May 17, 2020

@javierivanov kindly ping: #27920 (comment)

javierivanov added a commit to javierivanov/spark that referenced this pull request May 18, 2020
This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

No

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes apache#27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
maropu pushed a commit that referenced this pull request May 18, 2020
This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

Backport to 3.0 from #27920

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

No

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes #27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.micloud.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #28565 from javierivanov/SPARK-3.0-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
turboFei pushed a commit to turboFei/spark that referenced this pull request Jan 5, 2021
This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

No

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes apache#27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants