Skip to content

Java: Improve Regex flag parsing #15244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 16, 2024

Conversation

Marcono1234
Copy link
Contributor

Fixes:

  • Flag d not being recognized
  • Syntax for disabling flags (-) not being recognized
  • Non-capturing group with flags erroneously containing : as literal

Any feedback is appreciated! Especially if I overlooked or failed to consider something.

I am not sure if this change is also relevant for the regex libraries of the other languages.

Fixes:
- Flag `d` not being recognized
- Syntax for disabling flags (`-`) not being recognized
- Non-capturing group with flags erroneously containing `:` as literal
@@ -930,13 +941,13 @@ class Regex extends RegexString {

/**
* Gets a mode (if any) of this regular expression. Can be any of:
* DEBUG
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this DEBUG because it does not seem to be a possible value.

@@ -510,9 +513,15 @@ abstract class RegexString extends StringLiteral {
* ```
*/
private predicate flag(string c) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope, but this predicate might not behave as desired when the flags of a group don't affect the whole pattern, e.g. for a(?i:b) it would have the flag i as result even though this only affects the b.

Copy link
Contributor

@atorralba atorralba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! This LGTM overall, and it doesn't seem to cause significant alert result changes so it's a pretty safe change to make.

I have a question though. The inclusion of the d flag and the handling of : in capturing groups with flags are useful additions, but why include the handling of -? As I see it, its addition to modeCharacter only makes it necessary to exclude it in the flag predicate, but does nothing else. Can you explain if I'm missing something?

@Marcono1234
Copy link
Contributor Author

but why include the handling of -?

Without it parsing does not consider the - and any subsequent flag chars to be flags, but instead treats them as literal chars. For example when removing - handling for flagGroupStartNoModes and modeCharacter you get this incorrect test output diff for RegexParseTests.ql:

--- expected
+++ actual
@@ -1,4 +1,70 @@
 parseFailures
+| Test.java:22:9:22:85 | "(?idmsuxU-idmsuxU)a+(?-idmsuxU)b+(?idmsuxU:c)d+(?-idmsuxU:e)f+(?idmsuxU:)g+" | 0 |
+ ... (at lot more parse failures for this line)
@@ -209,25 +275,54 @@
 | Test.java:21:62:21:62 | b | [RegExpConstant,RegExpNormalChar] |
 | Test.java:21:64:21:64 | b | [RegExpConstant,RegExpNormalChar] |
 | Test.java:21:66:21:66 | b | [RegExpConstant,RegExpNormalChar] |
-| Test.java:22:10:22:27 | (?idmsuxU-idmsuxU) | [RegExpZeroWidthMatch] |
-| Test.java:22:10:22:84 | (?idmsuxU-idmsuxU)a+(?-idmsuxU)b+(?idmsuxU:c)d+(?-idmsuxU:e)f+(?idmsuxU:)g+ | [RegExpSequence] |
+| Test.java:22:10:22:27 | (?idmsuxU-idmsuxU) | [RegExpGroup] |
+| Test.java:22:19:22:19 | - | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:19:22:26 | -idmsuxU | [RegExpSequence] |
+| Test.java:22:20:22:20 | i | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:21:22:21 | d | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:22:22:22 | m | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:23:22:23 | s | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:24:22:24 | u | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:25:22:25 | x | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:26:22:26 | U | [RegExpConstant,RegExpNormalChar] |
 | Test.java:22:28:22:28 | a | [RegExpConstant,RegExpNormalChar] |
 | Test.java:22:28:22:29 | a+ | [RegExpPlus] |
-| Test.java:22:30:22:40 | (?-idmsuxU) | [RegExpZeroWidthMatch] |
+| Test.java:22:32:22:32 | - | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:33:22:33 | i | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:34:22:34 | d | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:35:22:35 | m | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:36:22:36 | s | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:37:22:37 | u | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:38:22:38 | x | [RegExpConstant,RegExpNormalChar] |
+| Test.java:22:39:22:39 | U | [RegExpConstant,RegExpNormalChar] |
 | Test.java:22:41:22:41 | b | [RegExpConstant,RegExpNormalChar] |
...

Copy link
Contributor

@atorralba atorralba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I assumed everything until the end of the group would get ignored, but apparently not. Thanks!

@atorralba
Copy link
Contributor

atorralba commented Jan 16, 2024

Oh, and about this:

I am not sure if this change is also relevant for the regex libraries of the other languages.

Most of the regex parsing logic should go into a shared library eventually, so hopefully all languages will get synced (including these fixes) when that happens.

@atorralba atorralba merged commit 2246c96 into github:main Jan 16, 2024
@Marcono1234 Marcono1234 deleted the marcono1234/regex-flags branch January 16, 2024 21:03
@Marcono1234
Copy link
Contributor Author

Most of the regex parsing logic should go into a shared library eventually, so hopefully all languages will get synced (including these fixes) when that happens.

I had a look at the Python implementation now nonetheless since the Java CodeQL implementation seems to be based on it. And I have created similar changes there, #15345. I hope this is ok for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants