Add support for character classes `[...]` #250

StefanosChaliasos · 2025-03-06T09:48:35Z

Fixes #249

It also makes the quantifiers more expressive:

I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

eliotwrobson · 2025-03-06T16:38:08Z

@StefanosChaliasos thanks for this contribution! A bit busy today but I'll try to review this at some point in the evening.

coveralls · 2025-03-06T16:40:12Z

coverage: 98.008% (-1.6%) from 99.613%
when pulling df53edc on StefanosChaliasos:add-support-for-charclass
into 1bdf9b7 on caleb531:develop.

…like ('\')

eliotwrobson

It looks like some dead code was included accidentally? Otherwise the logic itself seems fine.

eliotwrobson · 2025-03-06T19:00:48Z

automata/regex/parser.py

+        if is_negated:
+            expanded_content = expanded_content[1:]  # Remove ^ from the content
+
+        return cls(match.group(), expanded_content, is_negated)


It looks like there are some missing fields that are used in the constructor? Shouldn't this line throw an exception?

EDIT: Based on the coverage report, this line isn't being hit at all.

eliotwrobson · 2025-03-06T19:05:00Z

tests/test_regex.py

+        greek_nfa = NFA.from_regex("[Ͱ-Ͽ]+", input_symbols=input_symbols)
+        cyrillic_nfa = NFA.from_regex("[Ѐ-ӿ]+", input_symbols=input_symbols)
+
+        latin_samples = ["¡", "£", "Ā", "ŕ", "ƿ"]


Can we use the \u... notation? This will make these tests easier to maintain (albiet less elegant in the editor).

eliotwrobson · 2025-03-06T19:06:24Z

automata/regex/parser.py

+        self.counter = counter
+
+    @classmethod
+    def from_match(cls: Type[Self], match: re.Match) -> Self:


It seems like the logic here heavily overlaps with the process_char_class function. Could one of these be made to call the other?

eliotwrobson · 2025-03-06T19:09:12Z

automata/regex/parser.py

+        )
+
+    lexer.register_token(
+        character_class_factory,


Would personally prefer to use the from_match syntax the way the other token types are registered, but either syntax is fine. But it seems like the from_match in the new token class isn't being called at all.

eliotwrobson · 2025-03-06T19:10:07Z

automata/regex/parser.py

@@ -577,3 +691,50 @@ def parse_regex(regexstr: str, input_symbols: AbstractSet[str]) -> NFARegexBuild
    postfix = tokens_to_postfix(tokens_with_concats)

    return parse_postfix_tokens(postfix)
+
+
+def process_char_class(class_str: str) -> Tuple[bool, Set[str]]:


Nit: Might be good to have a couple of small test cases for this function independently to aid in debugging later, but won't make any hard requests for this.

…osChaliasos/automata into pr/250

StefanosChaliasos · 2025-03-06T19:21:33Z

Will go over everything tomorrow. Thanks a lot for the feedback.

We also added more complex tests

StefanosChaliasos · 2025-03-07T08:54:24Z

I did some more changes, can you review the new ones. Basically I added support for shorthand (e.g., '\d') and I tokenised whitespace. I need to add more tests and polish the code. I'll change the PR as a draft until done.

caleb531 · 2025-03-14T19:53:54Z

automata/fa/nfa.py

+        if "\\s" in regex:
+            additional_symbols.update(WHITESPACE_CHARS)
+        if "\\S" in regex:
+            additional_symbols.update(NON_WHITESPACE_CHARS)
+        if "\\d" in regex:
+            additional_symbols.update(DIGIT_CHARS)
+        if "\\D" in regex:
+            additional_symbols.update(NON_DIGIT_CHARS)
+        if "\\w" in regex:
+            additional_symbols.update(WORD_CHARS)
+        if "\\W" in regex:
+            additional_symbols.update(NON_WORD_CHARS)


@StefanosChaliasos Can you please refactor this to use a dict-based lookup table? That would make this much less repetitive.

cc @eliotwrobson

caleb531 · 2025-03-14T19:54:43Z

automata/fa/nfa.py

+        from automata.regex.parser import (
+            DIGIT_CHARS,
+            NON_DIGIT_CHARS,
+            NON_WHITESPACE_CHARS,
+            NON_WORD_CHARS,
+            WHITESPACE_CHARS,
+            WORD_CHARS,
+        )


@StefanosChaliasos Can you please keep all imports at the top of the file? There's no particular need for the tighter scoping here, IMO.

cc @eliotwrobson

caleb531 · 2025-03-14T19:58:45Z

automata/fa/nfa.py

+                        additional_symbols.update(WORD_CHARS)
+                        pos += 2
+                        continue
+                    elif class_content[pos + 1] in "S":


@StefanosChaliasos What is the intention of using in here as opposed to ==? If the right-hand side is just a single character, the only difference that seems to make is permitting class_content[pos + 1] to be empty string (in addition to the character itself). In other words:

"S" in "S" True "" in "S" # True

caleb531 · 2025-03-14T19:59:30Z

automata/fa/nfa.py

+                        continue
+
+                    # Handle escape sequence in character class
+                    from automata.regex.parser import _handle_escape_sequences


@StefanosChaliasos Can you also please move this import to the top of the file?

caleb531 · 2025-03-14T20:02:09Z

automata/regex/parser.py

@@ -24,10 +25,21 @@
    validate_tokens,
 )

+# Add these at the top of the file to define our shorthand character sets
+ASCII_PRINTABLE_CHARS = frozenset(string.printable)


@eliotwrobson The implication here is that only ASCII characters are deemed as printable characters, but how would that work given that #233 just added support for Unicode characters?

caleb531 · 2025-03-14T20:05:16Z

Hey, @StefanosChaliasos! I left some additional comments on the PR—apologies if they seem nitpicky, but just wanting to maintain solid code quality and consistency for this project.

StefanosChaliasos · 2025-03-14T21:04:29Z

Thanks for the review, I will address the comments once I find some time

StefanosChaliasos added 4 commits March 6, 2025 11:46

Add support for character classes [...]

7d8c4e1

Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

Support class characters when no input_symbols are given

7c291b6

Lint fixes and one more test

db5b6eb

Fixx issue with reserved characters inside character class

298f559

eliotwrobson changed the base branch from main to develop March 6, 2025 18:52

StefanosChaliasos and others added 2 commits March 6, 2025 21:08

Add support for escaped characters and properly handle special chars …

ce0a5ac

…like ('\')

Add missing annotation

00ac161

eliotwrobson requested changes Mar 6, 2025

View reviewed changes

Merge branch 'add-support-for-charclass' of https://github.com/Stefan…

227acca

…osChaliasos/automata into pr/250

StefanosChaliasos added 2 commits March 7, 2025 08:54

Add support for shorthands

f400c34

Allow reserved chars in input symbols and tokenize spaces

df53edc

We also added more complex tests

StefanosChaliasos marked this pull request as draft March 7, 2025 08:54

caleb531 reviewed Mar 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for character classes `[...]` #250

Add support for character classes `[...]` #250

StefanosChaliasos commented Mar 6, 2025 •

edited

Loading

eliotwrobson commented Mar 6, 2025

coveralls commented Mar 6, 2025 •

edited

Loading

eliotwrobson left a comment

eliotwrobson Mar 6, 2025

eliotwrobson Mar 6, 2025

eliotwrobson Mar 6, 2025

eliotwrobson Mar 6, 2025

eliotwrobson Mar 6, 2025

StefanosChaliasos commented Mar 6, 2025

StefanosChaliasos commented Mar 7, 2025

caleb531 Mar 14, 2025

caleb531 Mar 14, 2025

caleb531 Mar 14, 2025

caleb531 Mar 14, 2025

caleb531 Mar 14, 2025

caleb531 commented Mar 14, 2025

StefanosChaliasos commented Mar 14, 2025

Add support for character classes [...] #250

Are you sure you want to change the base?

Add support for character classes [...] #250

Conversation

StefanosChaliasos commented Mar 6, 2025 • edited Loading

eliotwrobson commented Mar 6, 2025

coveralls commented Mar 6, 2025 • edited Loading

eliotwrobson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanosChaliasos commented Mar 6, 2025

StefanosChaliasos commented Mar 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caleb531 commented Mar 14, 2025

StefanosChaliasos commented Mar 14, 2025

Add support for character classes `[...]` #250

Add support for character classes `[...]` #250

StefanosChaliasos commented Mar 6, 2025 •

edited

Loading

coveralls commented Mar 6, 2025 •

edited

Loading