-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Java: Improve Regex flag parsing #15244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes: - Flag `d` not being recognized - Syntax for disabling flags (`-`) not being recognized - Non-capturing group with flags erroneously containing `:` as literal
@@ -930,13 +941,13 @@ class Regex extends RegexString { | |||
|
|||
/** | |||
* Gets a mode (if any) of this regular expression. Can be any of: | |||
* DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this DEBUG
because it does not seem to be a possible value.
@@ -510,9 +513,15 @@ abstract class RegexString extends StringLiteral { | |||
* ``` | |||
*/ | |||
private predicate flag(string c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably out of scope, but this predicate might not behave as desired when the flags of a group don't affect the whole pattern, e.g. for a(?i:b)
it would have the flag i
as result even though this only affects the b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! This LGTM overall, and it doesn't seem to cause significant alert result changes so it's a pretty safe change to make.
I have a question though. The inclusion of the d
flag and the handling of :
in capturing groups with flags are useful additions, but why include the handling of -
? As I see it, its addition to modeCharacter
only makes it necessary to exclude it in the flag
predicate, but does nothing else. Can you explain if I'm missing something?
Without it parsing does not consider the --- expected
+++ actual
@@ -1,4 +1,70 @@
parseFailures
+| Test.java:22:9:22:85 | "(?idmsuxU-idmsuxU)a+(?-idmsuxU)b+(?idmsuxU:c)d+(?-idmsuxU:e)f+(?idmsuxU:)g+" | 0 |
+ ... (at lot more parse failures for this line)
@@ -209,25 +275,54 @@
| Test.java:21:62:21:62 | b | [RegExpConstant,RegExpNormalChar] |
| Test.java:21:64:21:64 | b | [RegExpConstant,RegExpNormalChar] |
| Test.java:21:66:21:66 | b | [RegExpConstant,RegExpNormalChar] |
-| Test.java:22:10:22:27 | (?idmsuxU-idmsuxU) | [RegExpZeroWidthMatch] |
-| Test.java:22:10:22:84 | (?idmsuxU-idmsuxU)a+(?-idmsuxU)b+(?idmsuxU:c)d+(?-idmsuxU:e)f+(?idmsuxU:)g+ | [RegExpSequence] |
+| Test.java:22:10:22:27 | (?idmsuxU-idmsuxU) | [RegExpGroup] |
+| Test.java:22:19:22:19 | - | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:19:22:26 | -idmsuxU | [RegExpSequence] |
+| Test.java:22:20:22:20 | i | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:21:22:21 | d | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:22:22:22 | m | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:23:22:23 | s | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:24:22:24 | u | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:25:22:25 | x | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:26:22:26 | U | [RegExpConstant,RegExpNormalChar] |
| Test.java:22:28:22:28 | a | [RegExpConstant,RegExpNormalChar] |
| Test.java:22:28:22:29 | a+ | [RegExpPlus] |
-| Test.java:22:30:22:40 | (?-idmsuxU) | [RegExpZeroWidthMatch] |
+| Test.java:22:32:22:32 | - | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:33:22:33 | i | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:34:22:34 | d | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:35:22:35 | m | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:36:22:36 | s | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:37:22:37 | u | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:38:22:38 | x | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:39:22:39 | U | [RegExpConstant,RegExpNormalChar] |
| Test.java:22:41:22:41 | b | [RegExpConstant,RegExpNormalChar] |
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I assumed everything until the end of the group would get ignored, but apparently not. Thanks!
Oh, and about this:
Most of the regex parsing logic should go into a shared library eventually, so hopefully all languages will get synced (including these fixes) when that happens. |
I had a look at the Python implementation now nonetheless since the Java CodeQL implementation seems to be based on it. And I have created similar changes there, #15345. I hope this is ok for now. |
Fixes:
d
not being recognized-
) not being recognized:
as literalAny feedback is appreciated! Especially if I overlooked or failed to consider something.
I am not sure if this change is also relevant for the regex libraries of the other languages.