Skip to content

normalizers.Replace able to support regex group capture #1760

Open
@nrv

Description

@nrv

Is it possible to have the Replace function support the group capture in the replacement string ?

In the following dummy example, I want to add a space between letters l and e. It works with the re package but not with normalizers.

import re
from tokenizers import normalizers, Regex

pattern = r"(l)(e)"
replacement = r"\1 \2"

text = "le travail est totalement pénible"

text1 = normalizers.Replace(Regex(pattern), replacement).normalize_str(text)
text2 = re.sub(pattern, replacement, text)

print(f"{text  = }")
print(f"{text1 = }")
print(f"{text2 = }")

execution result :

text  = 'le travail est totalement pénible'
text1 = '\\1 \\2 travail est tota\\1 \\2ment pénib\\1 \\2'
text2 = 'l e travail est total ement pénibl e'

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions