Skip to content

Message.Entities returns Length of UTF16 encoded string, not UTF8 supported by Golang #231

Open
@Fef0

Description

@Fef0

How I discovered it

I wanted to get the text + emoji that contained a particular link, but I always got the right Offset with a wrong Length (which is correct for UTF16, but not for my original string in UTF8).
Telegram uses UTF16 encoding for calculating Length and Offset so when just ASCII text is used there are no problems at all, since ASCII always uses 1 byte for each character. Once an Emoji is used, due to emojis different sizes, the calculation starts to be wrong.

How I solved this particular problem

I used the unicode/utf16 library in order to encode the original text, extract the text I wanted and then convert it to a UTF8 string again.

The Code

Given update of Update type, I wanted to extract each text with an embedded link by using Entities attribute.
The original message was "➡️Click Me⬅️ or ➡️Click Me⬅️" with "https://www.example.com/" embedded on both (just as a test).

Not Working Code

Using the following code (not using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
        // Get the text I need 
        str = str[e.Offset : e.Offset+e.Length]
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click 
�️ or ➡�

As you can see the second Emoji of the first element isn't just there, while the second element is just broken.

Working Code

The following is a piece of code that totally works (using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
// For each entity
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
	// Encode it into utf16
	utfEncodedString := utf16.Encode([]rune(str))
	// Decode just the piece of string I need
	runeString := utf16.Decode(utfEncodedString[e.Offset : e.Offset+e.Length])
	// Transform []rune into string
	str = string(runeString)
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click Me⬅️
➡️Click Me⬅️

Elements are just as they should be.

Conclusion

As you can see the Offset and Length are always the same and are actually correct when using UTF16.
Hope it will help anyone having the same issue!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions