Skip to content

Conversation

@asolove
Copy link
Contributor

@asolove asolove commented Mar 1, 2024

Resolves #508.

Testing

I constructed a sample paste payload that included a valid function definition plus some ZWS characters. Here is what I got:

Before:
image

After:
image

Discussion

I think this change is net positive right away, but there's a larger possible discussion here:

  • There are a lot of formatting-only characters in Unicode, and eventually we'll run into others we also want to strip out.
  • But, those characters have meaning, or are even required, when used in human-readable text, so a smarter version of this would ideally leave them in comments and string literals that appear in the source. (Of course, that would require being able to parse the program text, which our parser currently can't do when these characters are included, so ...)

@jpolitz
Copy link
Member

jpolitz commented Mar 1, 2024

Re: parsing, the CodeMirror mode is often a good “fuzzy” parser that can be used for some of these cases. Anchor has a nice function for detecting if a block is complete when pressing enter by checking the mode's state. I think that checking getTokenAt and trusting CM's answer on if it's in a string or not could be a reasonable way to handle this. Since it's already on a syntactically-invalid file some fuzziness is fine.

For example: https://github.com/brownplt/pyret-lang/blob/anchor/ide/src/utils.ts#L79 (without reconstructing all the details in a comment, this is trying to determine if the current cursor position is e.g. right after an end that is fully outdented.)

@jpolitz
Copy link
Member

jpolitz commented Mar 1, 2024

(I guess in particular checking for being in a string is a tokenization problem rather than a parsing problem, and the CodeMirror mode is an excellent fuzzy tokenizer)

@blerner
Copy link
Member

blerner commented Mar 1, 2024

You can also run the actual Pyret tokenizer pretty easily, and if it produces an UNKNOWN token we should probably strip that character out...

@schanzer
Copy link
Contributor

schanzer commented Mar 2, 2024

Relevance-through-adjacency: CodeMirror 6 has a much more powerful parser, which produces actual ASTs (complete with an API!). These ASTs are also fuzzy, but with the extra context available we may be able to make smarter decisions about what to do.

(It also opens up more options for ways to style text, provide hints, etc)

@blerner
Copy link
Member

blerner commented Mar 2, 2024

It also opens up incompatibilities with Pyret's actual parser and lexer, since we'd now have two grammars, two lexers and two parsers, etc. I've been hesitant about adding CM6's parsing for a while, for maintenance reasons. I don't know that Pyret's grammar is unambiguous enough for TreeSitter to handle. I do know that Pyret's delimiter syntax is unambiguous enough for our indenter to handle it, and I don't know that CM6's parse-error recovery matches our current behavior. I agree that having a better parser would be nice for better highlighting (it's a longstanding annoyance that type annotations don't get highlighted consistently/correctly), but it's a large code shift.

@blerner
Copy link
Member

blerner commented Mar 2, 2024

(There's also a technical compatibility issue I was never able to figure out -- Pyret's tokenization is ever so slightly whitespace-dependent: ( is tokenized two ways, depending on if it's preceded by whitespace, so that we can parse f (3 + 4) and f(3 + 4) differently, so that we can warn that the former is ill-formed with two expressions on one line. I don't think I can accomplish the same thing with CM6 and TreeSitter.)

@schanzer
Copy link
Contributor

schanzer commented Mar 4, 2024

@blerner I spent some time last night reading Marijn's blog post about lezer, and poking around the Tree-Sitter website. I understood maybe 5% of what's there, but it does seem like Tree Sitter can handle an extraordinary number of language grammars.

The section on error-recovery is especially interesting, and Marijn talks a lot about using tricks from GLR parsing to deal with ambiguity, recovery, and more. I dimly recall you talking about GLR as well, years ago.

I know you've got far higher priorities than engaging in hypotheticals, but at some point I'd love to have you dive into this a bit and see if it's even possible.

@schanzer
Copy link
Contributor

@blerner and @jpolitz anything we can do to get this PR merged sooner rather than later? There's an active need for it from the Algebra 2 crew

@blerner
Copy link
Member

blerner commented Mar 12, 2024

I think this could go in as-is, it's a low risk, minimal change.

@schanzer
Copy link
Contributor

@jpolitz ane @blerner poking again about getting this merged and up on CPO

@jpolitz jpolitz merged commit ef0d4da into brownplt:horizon Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can HTML entities also be replaced?

4 participants