Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match.gsub() does not capture patterns after unicode characters #3982

Open
Up05 opened this issue Jul 26, 2024 · 1 comment
Open

match.gsub() does not capture patterns after unicode characters #3982

Up05 opened this issue Jul 26, 2024 · 1 comment

Comments

@Up05
Copy link

Up05 commented Jul 26, 2024

Context

Odin:    dev-2024-07
OS:      Windows 10 Professional (version: 22H2), build 19045.3570
CPU:     AMD Ryzen 5 3600 6-Core Processor
RAM:     16286 MiB
Backend: LLVM 17.0.1

Expected Behavior

I expect match.gsub_allocator to substitute strings after unicode characters, but it does not.

Failure Information (for bugs)

gsub example by @Kelimion:

package bug

import "core:fmt"
import "core:text/match"

Pattern :: struct{
    haystack: string,
    needle:   string,
    replace:  string,
    expected: string,
}

Patterns :: []Pattern{
    {"aeaaea 5", "5", "4", "aeaaea 4"},
    {"aeaąęą 5", "5", "4", "aeaąęą 4"},
}

main :: proc() {
    for pattern in Patterns {
        res := match.gsub(pattern.haystack, pattern.needle, pattern.replace)
        fmt.printfln("Expected %q, got %q", pattern.expected, res)
    }
}
Expected "aeaaea 4", got "aeaaea 4"
Expected "aea\u0105\u0119\u0105 4", got "aea\u0105\u0119\u0105 5"

Steps to Reproduce

Minimal example:

package main
import "core:c/libc"
import "core:fmt"
import "core:text/match"
main :: proc(){
    libc.system("chcp 65001") // windows only
    fmt.println(match.gsub_allocator("5 aeaaea 5", "5", "4")) // prints: 4 aeaaea 4
    fmt.println(match.gsub_allocator("5 aeaąęą 5", "5", "4")) // prints: 4 aeaąęą 5
}
@laytan
Copy link
Collaborator

laytan commented Jul 28, 2024

core:text/match is a port of Lua's pattern APIs and therefore has the same behaviour of working bytewise instead of supporting utf-8. This WIKI talks about Lua and unicode.

I will keep this open because we are obviously not Lua and Odin is built on/with/for utf-8 and there may be ways to improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants