-
Notifications
You must be signed in to change notification settings - Fork 60
Capture regular expression groups when lexing. #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks for contributing! Is the primary motivation here performance? I'm not sure it's possible to make this compatible with the RPython version, so that will require some thinking before this can land. |
|
I've been investigating the translation failure; it was my understanding that container types (lists, tuples, etc.) must be type-homogeneous internally (groups are always tuples of strings) and that None was an allowed exception to this (as there may be no groups, None is a possible value instead of a tuple). I may be missing something, though, as I'm very new to RPython. The primary motivator is de-duplication of work. The regex during the tokenization step will already be capturing the groups, but the tokenizer just throws that information away. In the string parsing example the key elements needed (Python-style string flags and the contents of the quoted string) would need to be re-extracted in the parser. |
|
lists need to be homogenous internally, tuples are allowed to be On Mon, May 5, 2014 at 10:32 AM, Alice Zoë Bevan–McGregor <
"I disapprove of what you say, but I will defend to the death your right to |
|
Indeed, lists would make more sense now that I know tuples are even weirder than I expected. ;) Let me patch and see if this fixes the test failure locally. |
|
So, I've converted the regex group storage to using a list, however this has not corrected the somewhat mystifying translation error I'm getting:
|
|
I think the answer is that the code in the |
As certain token constructs represent elements being wrapped—such as text being wrapped in enclosing quotes—the parser step would need to pre-process the token to remove the quotes and identify flags (in the case for Python-style prefixed strings anyway. Why do the work twice?
The attached changes add slots and update
__repr__implementations where needed, and include a test for the "quoted string" case, demonstrating use. Documentation is also updated to clearly demonstrate the "quoted string" use case and update the presented objectreproutput.