-
Notifications
You must be signed in to change notification settings - Fork 48
Remove zero-width space characters when pasting text into CPO. #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Re: parsing, the CodeMirror mode is often a good “fuzzy” parser that can be used for some of these cases. Anchor has a nice function for detecting if a block is complete when pressing enter by checking the mode's state. I think that checking For example: https://github.com/brownplt/pyret-lang/blob/anchor/ide/src/utils.ts#L79 (without reconstructing all the details in a comment, this is trying to determine if the current cursor position is e.g. right after an |
|
(I guess in particular checking for being in a string is a tokenization problem rather than a parsing problem, and the CodeMirror mode is an excellent fuzzy tokenizer) |
|
You can also run the actual Pyret tokenizer pretty easily, and if it produces an UNKNOWN token we should probably strip that character out... |
|
Relevance-through-adjacency: CodeMirror 6 has a much more powerful parser, which produces actual ASTs (complete with an API!). These ASTs are also fuzzy, but with the extra context available we may be able to make smarter decisions about what to do. (It also opens up more options for ways to style text, provide hints, etc) |
|
It also opens up incompatibilities with Pyret's actual parser and lexer, since we'd now have two grammars, two lexers and two parsers, etc. I've been hesitant about adding CM6's parsing for a while, for maintenance reasons. I don't know that Pyret's grammar is unambiguous enough for TreeSitter to handle. I do know that Pyret's delimiter syntax is unambiguous enough for our indenter to handle it, and I don't know that CM6's parse-error recovery matches our current behavior. I agree that having a better parser would be nice for better highlighting (it's a longstanding annoyance that type annotations don't get highlighted consistently/correctly), but it's a large code shift. |
|
(There's also a technical compatibility issue I was never able to figure out -- Pyret's tokenization is ever so slightly whitespace-dependent: |
|
@blerner I spent some time last night reading Marijn's blog post about lezer, and poking around the Tree-Sitter website. I understood maybe 5% of what's there, but it does seem like Tree Sitter can handle an extraordinary number of language grammars. The section on error-recovery is especially interesting, and Marijn talks a lot about using tricks from GLR parsing to deal with ambiguity, recovery, and more. I dimly recall you talking about GLR as well, years ago. I know you've got far higher priorities than engaging in hypotheticals, but at some point I'd love to have you dive into this a bit and see if it's even possible. |
|
I think this could go in as-is, it's a low risk, minimal change. |
Resolves #508.
Testing
I constructed a sample paste payload that included a valid function definition plus some ZWS characters. Here is what I got:
Before:

After:

Discussion
I think this change is net positive right away, but there's a larger possible discussion here: