This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
Figure out what we're doing with ICU tokenisation and locales #15124
Open
Description
The ICU tokenisation rules seem to vary on different platforms.
Is it the ICU version? The locale? (How does ICU even get a default locale? I had a quick spelunk in the source code and couldn't find it!)
We need to figure out:
- what we actually want from the ICU library
- how we get that
- how we get consistent results.
It feels like we want a 'universal' locale independent of the host's settings, so that Synapse works well with all languages. (This may be a pie in the sky goal!)
What's the best we can do?
This issue was originally dug up in #15079, but e.g. Patrick's machine generates another tokenisation yet again. I'm not satisfied with the current solution..