You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation for the newly added int.toUnicode() predicate says:
Returns the unicode character for the receiver seen as a unicode code point
This is slightly misleading because CodeQL strings consist of UTF-16 code points. Therefore supplementary code points (> U+FFFF) will result in two CodeQL string characters (demonstrated by this query). It might also be good to describe its behavior for invalid code point values. For surrogate code point it does not seem to have a result either, e.g. 55296.toUnicode().
Also it should uppercase "Unicode".
I would recommend the following description (or similar):
Returns the Unicode character for the receiver seen as a Unicode code point. Because CodeQL strings consist of UTF-16 code units, supplementary code points (that is > U+FFFF) result in a CodeQL string of length 2. This predicate has no result if the int receiver does not represent a valid Unicode code point, or represents the code point of a surrogate character.
This requires changes to the built-in documentation (which is why I created the issue here) as well as the language specification.
The text was updated successfully, but these errors were encountered:
Yes, something like 128512.toUnicode() will result in a string where the length() is 2.
And yes, invalid/surrogate characters have no result.
So you're right the documentation might be a bit misleading.
String lengths are hard and they are not a very useful measure, but they are probably the best we got for describing what happens for code points like 55296.
I tried to see if I could rewrite your suggestion into something that's more explicit the length of the string (and what kind of length it is), but that didn't turn out good.
So I think I might go with your suggestion. I'll let you know.
The documentation for the newly added
int.toUnicode()
predicate says:This is slightly misleading because CodeQL strings consist of UTF-16 code points. Therefore supplementary code points (> U+FFFF) will result in two CodeQL string characters (demonstrated by this query). It might also be good to describe its behavior for invalid code point values. For surrogate code point it does not seem to have a result either, e.g.
55296.toUnicode()
.Also it should uppercase "Unicode".
I would recommend the following description (or similar):
This requires changes to the built-in documentation (which is why I created the issue here) as well as the language specification.
The text was updated successfully, but these errors were encountered: