Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support splitting Strings into Unicode Grapheme Cluster #22117

Open
2 tasks done
peppergrayxyz opened this issue Aug 26, 2024 · 1 comment
Open
2 tasks done

Support splitting Strings into Unicode Grapheme Cluster #22117

peppergrayxyz opened this issue Aug 26, 2024 · 1 comment
Labels
Feature Request This issue is made to request a feature.

Comments

@peppergrayxyz
Copy link
Contributor

peppergrayxyz commented Aug 26, 2024

Describe the feature

When working with Unicode, we usually don't care about the bytes, but we usually also don't care about the code points (runes). What we mostly care is characters displayed on screen (grapheme clusters). Unicode provides an algorithm to split strings into grapheme clusters (units of display width one). This feature is about including grapheme cluster splitting into builtin.

Use Case

Anyone working with a UI, who wants to know:

  • how long is a string (display characters on the screen)
  • where is the pointer on screen
    Neither bytes nor runes provide this information
  • use format strings with unicode strings

Example:

This text should be right aligned:

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

println("0123456789abcdefgh")
for text in examples 
{
	println("${text:10}")
}
0123456789abcdefgh
         ñ
    🏳️‍🌈
        ห์
   ปีเตอร์

But it isn't.

Proposed Solution

Add a feature to split a string into graphemes

hello := 'Hello World 🏳️‍🌈'
hello_graphemes := hello.graphemes () // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `🏳️‍🌈`]

Current Behavior

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

for text in examples 
{
	println("0123456789abcdefgh")
	println(text)
	println(text.runes())
}
0123456789abcdefgh
ñ
[`n`, `̃`]
0123456789abcdefgh
🏳️‍🌈
[`🏳`, `️`, `‍`, `🌈`]
0123456789abcdefgh
ห์
[`ห`, `์`]
0123456789abcdefgh
ปีเตอร์
[`ป`, `ี`, `เ`, `ต`, `อ`, `ร`, `์`]

Proposed behavior:

examples := [
	'\u006E\u0303',
	'\U0001F3F3\uFE0F\u200D\U0001F308',
	'ห์', 
	'ปีเตอร์'
]

for text in examples 
{
	println("0123456789abcdefgh")
	println(text)
	println(text.graphemes())
}
0123456789abcdefgh
ñ
[`ñ`]
0123456789abcdefgh
🏳️‍🌈
[`🏳️‍🌈`]
0123456789abcdefgh
ห์
[`ห์`]
0123456789abcdefgh
ปีเตอร์
[`ปี`, `เ`, `ต`, `อ`, `ร์`]

Further suggestions

  • consider removing runes (or consider replacing the implementation to use grapheme clusters instead of codepoints):
    • What is the rational to have them?
    • In which situation do you want to work with codepoints but not grapheme clusters?
  • consider using grapheme clusters for width calculation of format strings
  • consider making grapheme clusters a first class citizen and hide bytes behind a call

e.g.

string[n] ... access n-th grapheme
string.len ... number of graphemes
string.bytes()[n] ... access n-th byte
string.bytes().len ... number of bytes

Other Information

Unicode Reference and some more info on the background

This feature would also fix this bug:

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

Version used

0.4.7

Environment details (OS name and version, etc.)

V full version: V 0.4.7 7baff15
OS: linux, "Manjaro Linux"
Processor: 16 cpus, 64bit, little endian, AMD Ryzen 7 7840U w/ Radeon  780M Graphics

getwd: /home/pepper
vexe: /usr/lib/vlang/v
vexe mtime: 2024-08-26 17:34:57

vroot: NOT writable, value: /usr/lib/vlang
VMODULES: OK, value: /home/pepper/.vmodules
VTMP: OK, value: /tmp/v_1000

Git version: git version 2.46.0
Git vroot status: Error: fatal: not a git repository (or any of the parent directories): .git
.git/config present: false

CC version: cc (GCC) 14.2.1 20240805
thirdparty/tcc status: thirdparty-linux-amd64 0134e9b9-dirty

Note

You can use the 👍 reaction to increase the issue's priority for developers.

Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.

@peppergrayxyz peppergrayxyz added the Feature Request This issue is made to request a feature. label Aug 26, 2024
@Wajinn
Copy link

Wajinn commented Sep 14, 2024

Maybe want to take a look at uniseg or possibly consult with magic003, if available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request This issue is made to request a feature.
Projects
None yet
Development

No branches or pull requests

7 participants
@Wajinn @peppergrayxyz and others