-
Notifications
You must be signed in to change notification settings - Fork 69
Add support for character classes [...]
#250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add support for character classes [...]
#250
Conversation
Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}
@StefanosChaliasos thanks for this contribution! A bit busy today but I'll try to review this at some point in the evening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some dead code was included accidentally? Otherwise the logic itself seems fine.
if is_negated: | ||
expanded_content = expanded_content[1:] # Remove ^ from the content | ||
|
||
return cls(match.group(), expanded_content, is_negated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are some missing fields that are used in the constructor? Shouldn't this line throw an exception?
EDIT: Based on the coverage report, this line isn't being hit at all.
greek_nfa = NFA.from_regex("[Ͱ-Ͽ]+", input_symbols=input_symbols) | ||
cyrillic_nfa = NFA.from_regex("[Ѐ-ӿ]+", input_symbols=input_symbols) | ||
|
||
latin_samples = ["¡", "£", "Ā", "ŕ", "ƿ"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the \u...
notation? This will make these tests easier to maintain (albiet less elegant in the editor).
self.counter = counter | ||
|
||
@classmethod | ||
def from_match(cls: Type[Self], match: re.Match) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the logic here heavily overlaps with the process_char_class
function. Could one of these be made to call the other?
) | ||
|
||
lexer.register_token( | ||
character_class_factory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would personally prefer to use the from_match
syntax the way the other token types are registered, but either syntax is fine. But it seems like the from_match
in the new token class isn't being called at all.
@@ -577,3 +691,50 @@ def parse_regex(regexstr: str, input_symbols: AbstractSet[str]) -> NFARegexBuild | |||
postfix = tokens_to_postfix(tokens_with_concats) | |||
|
|||
return parse_postfix_tokens(postfix) | |||
|
|||
|
|||
def process_char_class(class_str: str) -> Tuple[bool, Set[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Might be good to have a couple of small test cases for this function independently to aid in debugging later, but won't make any hard requests for this.
Will go over everything tomorrow. Thanks a lot for the feedback. |
We also added more complex tests
I did some more changes, can you review the new ones. Basically I added support for shorthand (e.g., '\d') and I tokenised whitespace. I need to add more tests and polish the code. I'll change the PR as a draft until done. |
if "\\s" in regex: | ||
additional_symbols.update(WHITESPACE_CHARS) | ||
if "\\S" in regex: | ||
additional_symbols.update(NON_WHITESPACE_CHARS) | ||
if "\\d" in regex: | ||
additional_symbols.update(DIGIT_CHARS) | ||
if "\\D" in regex: | ||
additional_symbols.update(NON_DIGIT_CHARS) | ||
if "\\w" in regex: | ||
additional_symbols.update(WORD_CHARS) | ||
if "\\W" in regex: | ||
additional_symbols.update(NON_WORD_CHARS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you please refactor this to use a dict
-based lookup table? That would make this much less repetitive.
from automata.regex.parser import ( | ||
DIGIT_CHARS, | ||
NON_DIGIT_CHARS, | ||
NON_WHITESPACE_CHARS, | ||
NON_WORD_CHARS, | ||
WHITESPACE_CHARS, | ||
WORD_CHARS, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you please keep all imports at the top of the file? There's no particular need for the tighter scoping here, IMO.
additional_symbols.update(WORD_CHARS) | ||
pos += 2 | ||
continue | ||
elif class_content[pos + 1] in "S": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos What is the intention of using in
here as opposed to ==
? If the right-hand side is just a single character, the only difference that seems to make is permitting class_content[pos + 1]
to be empty string (in addition to the character itself). In other words:
"S" in "S" True
"" in "S" # True
continue | ||
|
||
# Handle escape sequence in character class | ||
from automata.regex.parser import _handle_escape_sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you also please move this import to the top of the file?
@@ -24,10 +25,21 @@ | |||
validate_tokens, | |||
) | |||
|
|||
# Add these at the top of the file to define our shorthand character sets | |||
ASCII_PRINTABLE_CHARS = frozenset(string.printable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliotwrobson The implication here is that only ASCII characters are deemed as printable characters, but how would that work given that #233 just added support for Unicode characters?
Hey, @StefanosChaliasos! I left some additional comments on the PR—apologies if they seem nitpicky, but just wanting to maintain solid code quality and consistency for this project. |
Thanks for the review, I will address the comments once I find some time |
Fixes #249
It also makes the quantifiers more expressive:
I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}