Disable ropey `unicode_lines` feature #50

dhbrojas · 2023-12-02T09:06:50Z

Hi! First of all, thanks a lot for open-sourcing this.

I was working on an internal fork of the project and noticed a potential issue. I have little experience with the LSP and Ropey so this might not be relevant. Anyway, here's the issue:

With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls' representation might have more lines.

See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks
See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls might have more lines. See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

McPatate · 2023-12-02T11:39:11Z

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.
We could issue a warning to the client saying there might be some offsetting errors when utf-16 is the only supported format.
It could look something like this:

    async fn initialize(&self, params: InitializeParams) -> LspResult<InitializeResult> {
        *self.workspace_folders.write().await = params.workspace_folders;
+      let position_encoding = params.capabilities.general.and_then(|general_cap| {
+           general_cap.position_encodings.and_then(|encodings| {
+               if encodings.contains(&PositionEncodingKind::UTF8) {
+                   Some(PositionEncodingKind::UTF8)
+               } else {
+                    self.client.show_message(MessageType::WARNING, "utf8 is not supported, defaulting to utf16 which may result in offset errors").await;
+                   None
+               }
+           })
+       });
        Ok(InitializeResult {
            server_info: Some(ServerInfo {
                name: "llm-ls".to_owned(),
                version: Some(VERSION.to_owned()),
            }),
            capabilities: ServerCapabilities {
                text_document_sync: Some(TextDocumentSyncCapability::Kind(
                    TextDocumentSyncKind::INCREMENTAL,
                )),
+               position_encoding,
                ..Default::default()
            },
        })
    }

dhbrojas · 2023-12-03T02:21:26Z

Hi! Thanks for the quick response.

Ropey Line Breaks

I think this is a separate issue from position encoding negotiation. Enforcing UTF8 position encodings will not prevent Ropey's line count from diverging from Tree Sitter's or VSCode's, rendering lsp::Ranges out-of-sync with the Document.

Issues with Encoding

Regarding the PositionEncodingKind, we had to issue a fix for this as well in our fork.

After looking at llm-ls's implementation, we noticed the mirror of the user's workspace goes out of sync and/or the server crashes when the user's document contains certain graphemes as the current implementation doesn't translate well between:

LSP UTF-16 character offsets
TreeSitter's byte offsets
Ropey's characters offsets (read Unicode code points)

Here's a video showcasing 1) a crash and 2) llm-ls' mirror going out of sync due to some unicode characters.

LLM-LS.Bug.Report.mov

Here's a test case that illustrates this:

Long Test Case for `Document::change`

Case:

mod test {
    use tower_lsp::lsp_types::Position;
    use tree_sitter::Node;

    #[allow(unused_imports)]
    use super::*;

    #[tokio::test]
    async fn test_document_change_tree_consistency_medium() {
        let a = "let a = '🥸 你好';\rfunction helloWorld() { return '🤲🏿'; }\nlet b = 'Hi, 😊';";

        let mut document = Document::open("javascript", a).await.unwrap();

        document
            .change(Range::new(Position::new(0, 14), Position::new(2, 13)), "，")
            .await
            .unwrap();

        let b = "let a = '🥸 你好，😊';";

        assert_eq!(document.text.to_string(), b);

        let mut parser = Parser::new();

        parser
            .set_language(tree_sitter_javascript::language())
            .unwrap();

        let b_tree = parser.parse(b, None).unwrap();

        assert!(nodes_are_equal_recursive(
            &document.tree.unwrap().root_node(),
            &b_tree.root_node()
        ));
    }

    #[allow(dead_code)]
    fn nodes_are_equal_recursive(node1: &Node, node2: &Node) -> bool {
        if node1.kind() != node2.kind() {
            return false;
        }

        if node1.start_byte() != node2.start_byte() {
            return false;
        }

        if node1.end_byte() != node2.end_byte() {
            return false;
        }

        if node1.start_position() != node2.start_position() {
            return false;
        }

        if node1.end_position() != node2.end_position() {
            return false;
        }

        if node1.child_count() != node2.child_count() {
            return false;
        }

        for i in 0..node1.child_count() {
            let child1 = node1.child(i).unwrap();
            let child2 = node2.child(i).unwrap();

            if !nodes_are_equal_recursive(&child1, &child2) {
                return false;
            }
        }

        true
    }
}

Output:

running 1 test
test document::test::test_document_change_tree_consistency_medium ... FAILED

failures:

---- document::test::test_document_change_tree_consistency_medium stdout ----
thread 'document::test::test_document_change_tree_consistency_medium' panicked at 'assertion failed: `(left == right)`
  left: `"let a = '🥸 你好'，😊';"`,
 right: `"let a = '🥸 你好，😊';"`', crates/llm-ls/src/document.rs:293:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Enforcing UTF-8 Position Encoding

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.

This is by far the most convenient option, and we tried to go that route as well. Unfortunately, position encoding kind negotiation is a relatively new feature of the Language Server Protocol. It was introduced in 3.17.0, in 2022. As of writing, UTF-16 is the default, mandatory encoding of the protocol and must always be supported by servers.

I thought it was still worth a try but it turns out that the VSCode on my machine, the latest version, does not even support anything other than UTF-16.

[Info  - 09:40:14] Available position encodings: Some([PositionEncodingKind("utf-16")])

In our case, we deemed it was necessary to stick to UTF-16 for the following reasons:

We put all this effort into maintaining a perfect replica of the user's workspace to get accurate syntax trees to provide relevant completions at key points of the user's file. If we allow the user's file and our representation to become out of sync, we loose this ability and we would be better off with using regexes in the end.
Looking at VSCode's language client code, it doesn't seem to support UTF-16, concurring the experience I had on my machine. We didn't want to fight against the protocol and assumed many other clients would lack support for UTF-8 too.

If you managed to successfully negotiate UTF-8 position encodings, we'd love to hear more!

Once again, I might be wrong about some of the details of this. It's based off my current understanding of LSP and library implementations which is quite modest.

Anyway, all this Unicode handling caused quite the headaches on our side so we'd be happy to upstream our tests and implementation which handles the translation between the different encodings if you want. We're also looking at other approaches to simplify this complexity and if you're interested, I'm happy to collaborate on this together!

Bon Dimanche 🤗

McPatate · 2023-12-03T13:02:34Z

Thanks for the detailed response. I'd be happy to take a look at your code and more than happy to have you contribute to llm-ls.

My main concern is that the rope crate only supports utf8, so I'll have to check if there are other rope crates that do support utf16.

I'm also not sure how other editors fair regarding Unicode encoding, I'll take a look.

For now let's merge the current PR.

dhbrojas · 2023-12-04T03:44:52Z

Awesome! I'll kickstart a PR today or tomorrow.

I see the CI is not passing, it seems it's failing on main too, is that an issue?

McPatate · 2023-12-04T09:38:42Z

I see the CI is not passing, it seems it's failing on main too, is that an issue?

I'm not sure where it's coming from, just updated the CI's secret, let's see if that was the issue.

dhbrojas · 2023-12-08T04:33:33Z

Hi @McPatate, hope you're doing good!

I was working on upstreaming our document syncing implementation but two things happened:

1. Didn't get around to thoroughly test it

I'm pretty sure our current implementation is correct but I also wouldn't bet my hand on it. Currently we have a few unit tests that ensure the documents are kept in sync by simulating different kind of edits but it's hard to know whether it will stand the test of thousand of user files/actions once in production.

I'd feel bad about sharing under-tested code hence I'm currently looking around the web to try to find suitable edit traces that could help us simulate real-world, long-lived, complex editing sessions in our tests and ensure that everything matches in the end.

2. Uncovered additional issues this time relating to `tower-lsp`

There are other inherent design decisions of tower-lsp that could lead in our case and yours to out-of-sync-documents or completions based on outdated context¹ in rare and less rare cases so that ate up quite a bit of our time as we are looking to mitigate that too ☹️

Given all this, I've created a Gist and just dumped our document.rs implementation for reference.

If it looks good to you, I'll make a PR to integrate it
If you feel like it would benefit from more testing, I'll circle back once we found a scalable way to test it

LMK. Cheers!

See tower-lsp#284 deno#10437 ↩

McPatate · 2023-12-15T15:09:45Z

Hey @rojas-diego, thanks for the detailed message.

I'm going to merge the PR, just ran testbed locally and lgtm. Don't hesitate to run testbed yourself in future PRs, this is the way I test llm-ls.

Regarding 1., I think the tests you provided should cover most use cases. Maybe try to find unicode characters encoded with different sizes if it's not already the case. People will report bugs when they found them and I would assume that most code out there is ASCII only, with some occasional exception.

Regarding 2., I wouldn't worry too much about it for now. I do recall seeing strange behaviour that may be linked to such a thing, but cannot confirm as we have an encoding issue currently :)

Feel free to create a new PR and we can discuss the issues in more detail over there.

Thanks again for the effort!

McPatate · 2024-02-13T08:39:27Z

Hey @rojas-diego, do you still want to contribute your gist to the project?

McPatate approved these changes Dec 15, 2023

View reviewed changes

McPatate merged commit 2a433cd into huggingface:main Dec 15, 2023
1 of 14 checks passed

McPatate mentioned this pull request Feb 13, 2024

refactor: adaptor list should be an enum #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable ropey `unicode_lines` feature #50

Disable ropey `unicode_lines` feature #50

dhbrojas commented Dec 2, 2023

McPatate commented Dec 2, 2023 •

edited

Loading

dhbrojas commented Dec 3, 2023 •

edited

Loading

McPatate commented Dec 3, 2023 •

edited

Loading

dhbrojas commented Dec 4, 2023

McPatate commented Dec 4, 2023

dhbrojas commented Dec 8, 2023

McPatate commented Dec 15, 2023 •

edited

Loading

McPatate commented Feb 13, 2024

Disable ropey unicode_lines feature #50

Disable ropey unicode_lines feature #50

Conversation

dhbrojas commented Dec 2, 2023

McPatate commented Dec 2, 2023 • edited Loading

dhbrojas commented Dec 3, 2023 • edited Loading

Ropey Line Breaks

Issues with Encoding

Enforcing UTF-8 Position Encoding

McPatate commented Dec 3, 2023 • edited Loading

dhbrojas commented Dec 4, 2023

McPatate commented Dec 4, 2023

dhbrojas commented Dec 8, 2023

1. Didn't get around to thoroughly test it

2. Uncovered additional issues this time relating to tower-lsp

Footnotes

McPatate commented Dec 15, 2023 • edited Loading

McPatate commented Feb 13, 2024

Disable ropey `unicode_lines` feature #50

Disable ropey `unicode_lines` feature #50

McPatate commented Dec 2, 2023 •

edited

Loading

dhbrojas commented Dec 3, 2023 •

edited

Loading

McPatate commented Dec 3, 2023 •

edited

Loading

2. Uncovered additional issues this time relating to `tower-lsp`

McPatate commented Dec 15, 2023 •

edited

Loading