Description
How I discovered it
I wanted to get the text + emoji that contained a particular link, but I always got the right Offset with a wrong Length (which is correct for UTF16, but not for my original string in UTF8).
Telegram uses UTF16 encoding for calculating Length and Offset so when just ASCII text is used there are no problems at all, since ASCII always uses 1 byte for each character. Once an Emoji is used, due to emojis different sizes, the calculation starts to be wrong.
How I solved this particular problem
I used the unicode/utf16 library in order to encode the original text, extract the text I wanted and then convert it to a UTF8 string again.
The Code
Given update of Update type, I wanted to extract each text with an embedded link by using Entities attribute.
The original message was "➡️Click Me⬅️ or ➡️Click Me⬅️" with "https://www.example.com/" embedded on both (just as a test).
Not Working Code
Using the following code (not using unicode/utf16):
fmt.Println(*update.ChannelPost.Entities)
for _, e := range *update.ChannelPost.Entities {
// Get the whole update Text
str := update.ChannelPost.Text
// Get the text I need
str = str[e.Offset : e.Offset+e.Length]
fmt.Println(str)
}
Output
[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click
�️ or ➡�
As you can see the second Emoji of the first element isn't just there, while the second element is just broken.
Working Code
The following is a piece of code that totally works (using unicode/utf16):
fmt.Println(*update.ChannelPost.Entities)
// For each entity
for _, e := range *update.ChannelPost.Entities {
// Get the whole update Text
str := update.ChannelPost.Text
// Encode it into utf16
utfEncodedString := utf16.Encode([]rune(str))
// Decode just the piece of string I need
runeString := utf16.Decode(utfEncodedString[e.Offset : e.Offset+e.Length])
// Transform []rune into string
str = string(runeString)
fmt.Println(str)
}
Output
[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click Me⬅️
➡️Click Me⬅️
Elements are just as they should be.
Conclusion
As you can see the Offset and Length are always the same and are actually correct when using UTF16.
Hope it will help anyone having the same issue!