match.gsub() does not capture patterns after unicode characters #3982

Up05 · 2024-07-26T12:29:33Z

Context

Odin:    dev-2024-07
OS:      Windows 10 Professional (version: 22H2), build 19045.3570
CPU:     AMD Ryzen 5 3600 6-Core Processor
RAM:     16286 MiB
Backend: LLVM 17.0.1

Expected Behavior

I expect match.gsub_allocator to substitute strings after unicode characters, but it does not.

Failure Information (for bugs)

gsub example by @Kelimion:

package bug

import "core:fmt"
import "core:text/match"

Pattern :: struct{
    haystack: string,
    needle:   string,
    replace:  string,
    expected: string,
}

Patterns :: []Pattern{
    {"aeaaea 5", "5", "4", "aeaaea 4"},
    {"aeaąęą 5", "5", "4", "aeaąęą 4"},
}

main :: proc() {
    for pattern in Patterns {
        res := match.gsub(pattern.haystack, pattern.needle, pattern.replace)
        fmt.printfln("Expected %q, got %q", pattern.expected, res)
    }
}

Expected "aeaaea 4", got "aeaaea 4"
Expected "aea\u0105\u0119\u0105 4", got "aea\u0105\u0119\u0105 5"

Steps to Reproduce

Minimal example:

package main
import "core:c/libc"
import "core:fmt"
import "core:text/match"
main :: proc(){
    libc.system("chcp 65001") // windows only
    fmt.println(match.gsub_allocator("5 aeaaea 5", "5", "4")) // prints: 4 aeaaea 4
    fmt.println(match.gsub_allocator("5 aeaąęą 5", "5", "4")) // prints: 4 aeaąęą 5
}

The text was updated successfully, but these errors were encountered:

laytan · 2024-07-28T12:33:03Z

core:text/match is a port of Lua's pattern APIs and therefore has the same behaviour of working bytewise instead of supporting utf-8. This WIKI talks about Lua and unicode.

I will keep this open because we are obviously not Lua and Odin is built on/with/for utf-8 and there may be ways to improve it.

laytan added enhancement core-library labels Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

match.gsub() does not capture patterns after unicode characters #3982

match.gsub() does not capture patterns after unicode characters #3982

Up05 commented Jul 26, 2024 •

edited

Loading

laytan commented Jul 28, 2024

match.gsub() does not capture patterns after unicode characters #3982

match.gsub() does not capture patterns after unicode characters #3982

Comments

Up05 commented Jul 26, 2024 • edited Loading

Context

Expected Behavior

Failure Information (for bugs)

Steps to Reproduce

laytan commented Jul 28, 2024

Up05 commented Jul 26, 2024 •

edited

Loading