Remove zero-width space characters when pasting text into CPO. #512

asolove · 2024-03-01T12:45:28Z

Resolves #508.

Testing

I constructed a sample paste payload that included a valid function definition plus some ZWS characters. Here is what I got:

Before:

After:

Discussion

I think this change is net positive right away, but there's a larger possible discussion here:

There are a lot of formatting-only characters in Unicode, and eventually we'll run into others we also want to strip out.
But, those characters have meaning, or are even required, when used in human-readable text, so a smarter version of this would ideally leave them in comments and string literals that appear in the source. (Of course, that would require being able to parse the program text, which our parser currently can't do when these characters are included, so ...)

jpolitz · 2024-03-01T17:33:48Z

Re: parsing, the CodeMirror mode is often a good “fuzzy” parser that can be used for some of these cases. Anchor has a nice function for detecting if a block is complete when pressing enter by checking the mode's state. I think that checking getTokenAt and trusting CM's answer on if it's in a string or not could be a reasonable way to handle this. Since it's already on a syntactically-invalid file some fuzziness is fine.

For example: https://github.com/brownplt/pyret-lang/blob/anchor/ide/src/utils.ts#L79 (without reconstructing all the details in a comment, this is trying to determine if the current cursor position is e.g. right after an end that is fully outdented.)

jpolitz · 2024-03-01T17:34:23Z

(I guess in particular checking for being in a string is a tokenization problem rather than a parsing problem, and the CodeMirror mode is an excellent fuzzy tokenizer)

blerner · 2024-03-01T21:56:56Z

You can also run the actual Pyret tokenizer pretty easily, and if it produces an UNKNOWN token we should probably strip that character out...

schanzer · 2024-03-02T12:28:21Z

Relevance-through-adjacency: CodeMirror 6 has a much more powerful parser, which produces actual ASTs (complete with an API!). These ASTs are also fuzzy, but with the extra context available we may be able to make smarter decisions about what to do.

(It also opens up more options for ways to style text, provide hints, etc)

blerner · 2024-03-02T13:00:20Z

It also opens up incompatibilities with Pyret's actual parser and lexer, since we'd now have two grammars, two lexers and two parsers, etc. I've been hesitant about adding CM6's parsing for a while, for maintenance reasons. I don't know that Pyret's grammar is unambiguous enough for TreeSitter to handle. I do know that Pyret's delimiter syntax is unambiguous enough for our indenter to handle it, and I don't know that CM6's parse-error recovery matches our current behavior. I agree that having a better parser would be nice for better highlighting (it's a longstanding annoyance that type annotations don't get highlighted consistently/correctly), but it's a large code shift.

blerner · 2024-03-02T13:21:51Z

(There's also a technical compatibility issue I was never able to figure out -- Pyret's tokenization is ever so slightly whitespace-dependent: ( is tokenized two ways, depending on if it's preceded by whitespace, so that we can parse f (3 + 4) and f(3 + 4) differently, so that we can warn that the former is ill-formed with two expressions on one line. I don't think I can accomplish the same thing with CM6 and TreeSitter.)

schanzer · 2024-03-04T14:59:35Z

@blerner I spent some time last night reading Marijn's blog post about lezer, and poking around the Tree-Sitter website. I understood maybe 5% of what's there, but it does seem like Tree Sitter can handle an extraordinary number of language grammars.

The section on error-recovery is especially interesting, and Marijn talks a lot about using tricks from GLR parsing to deal with ambiguity, recovery, and more. I dimly recall you talking about GLR as well, years ago.

I know you've got far higher priorities than engaging in hypotheticals, but at some point I'd love to have you dive into this a bit and see if it's even possible.

schanzer · 2024-03-12T14:03:28Z

@blerner and @jpolitz anything we can do to get this PR merged sooner rather than later? There's an active need for it from the Algebra 2 crew

blerner · 2024-03-12T14:13:12Z

I think this could go in as-is, it's a low risk, minimal change.

schanzer · 2024-03-24T21:48:39Z

@jpolitz ane @blerner poking again about getting this merged and up on CPO

Remove zero-width space characters when pasting text into CPO.

f4fddab

jpolitz merged commit ef0d4da into brownplt:horizon Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove zero-width space characters when pasting text into CPO. #512

Remove zero-width space characters when pasting text into CPO. #512

Uh oh!

asolove commented Mar 1, 2024 •

edited

Loading

Uh oh!

jpolitz commented Mar 1, 2024

Uh oh!

jpolitz commented Mar 1, 2024

Uh oh!

blerner commented Mar 1, 2024

Uh oh!

schanzer commented Mar 2, 2024

Uh oh!

blerner commented Mar 2, 2024

Uh oh!

blerner commented Mar 2, 2024

Uh oh!

schanzer commented Mar 4, 2024

Uh oh!

schanzer commented Mar 12, 2024

Uh oh!

blerner commented Mar 12, 2024

Uh oh!

schanzer commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove zero-width space characters when pasting text into CPO. #512

Remove zero-width space characters when pasting text into CPO. #512

Uh oh!

Conversation

asolove commented Mar 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Discussion

Uh oh!

jpolitz commented Mar 1, 2024

Uh oh!

jpolitz commented Mar 1, 2024

Uh oh!

blerner commented Mar 1, 2024

Uh oh!

schanzer commented Mar 2, 2024

Uh oh!

blerner commented Mar 2, 2024

Uh oh!

blerner commented Mar 2, 2024

Uh oh!

schanzer commented Mar 4, 2024

Uh oh!

schanzer commented Mar 12, 2024

Uh oh!

blerner commented Mar 12, 2024

Uh oh!

schanzer commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asolove commented Mar 1, 2024 •

edited

Loading