Skip to content

Conversation

@juntae6942
Copy link
Contributor

@juntae6942 juntae6942 commented Sep 13, 2025

Closes: #35426

Currently, HtmlUtils.htmlUnescape() does not correctly handle numeric character references for Unicode supplementary characters (e.g., emojis).

For example, an entity like 😀 (😀) is incorrectly converted to a garbled character corresponding to U+F600 due to data truncation.

Step to Reproduce

public static void main(String[] args) {
        // Test character: 'Grinning Face' emoji (😀)
        // Unicode code point: U+1F600
        // Hexadecimal: 1F600
        // Decimal: 128512

        // 1. Input value as a decimal HTML entity
        String inputDecimal = "😀";

        // 2. Input value as a hexadecimal HTML entity
        String inputHex = "😀";

        // 3. The expected result after correct conversion
        String expectedOutput = "😀";

        System.out.println("--- Decimal HTML Entity Test ---");
        System.out.println("Input: " + inputDecimal);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputDecimal = HtmlUtils.htmlUnescape(inputDecimal);

        System.out.println("Actual Output: " + actualOutputDecimal);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputDecimal));

        System.out.println("\n--- Hexadecimal HTML Entity Test ---");
        System.out.println("Input: " + inputHex);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputHex = HtmlUtils.htmlUnescape(inputHex);

        System.out.println("Actual Output: " + actualOutputHex);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputHex));
    }
스크린샷 2025-09-13 오후 11 29 37

Cause

The root cause was a problematic cast to a 16-bit char in the HtmlCharacterEntityDecoder. This operation truncated any Unicode code point value greater than U+FFFF, leading to the loss of the most significant bits.

Solution

This PR resolves the issue by replacing the direct (char) cast with a call to StringBuilder.appendCodePoint().

The appendCodePoint() method is designed to handle the full range of Unicode code points. It correctly converts supplementary characters into a two-character surrogate pair, ensuring that all characters are unescaped without data loss. A corresponding unit test has been added to verify this fix.

Signed-off-by: potato <65760583+juntae6942@users.noreply.github.com>
@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from bc095df to 369ffe4 Compare September 14, 2025 04:41
Signed-off-by: potato <65760583+juntae6942@users.noreply.github.com>
@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from 369ffe4 to a6efa2a Compare September 14, 2025 04:48
Signed-off-by: potato <65760583+juntae6942@users.noreply.github.com>
@rstoyanchev rstoyanchev added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Nov 4, 2025
@bclozel bclozel self-assigned this Nov 17, 2025
@bclozel bclozel added type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Nov 17, 2025
@bclozel bclozel added this to the 7.0.1 milestone Nov 17, 2025
@bclozel bclozel added for: backport-to-6.2.x Marks an issue as a candidate for backport to 6.2.x and removed for: backport-to-6.2.x Marks an issue as a candidate for backport to 6.2.x labels Nov 17, 2025
bclozel pushed a commit that referenced this pull request Nov 17, 2025
See gh-35477

Signed-off-by: potato <65760583+juntae6942@users.noreply.github.com>
@bclozel bclozel closed this in 87d95dc Nov 17, 2025
bclozel pushed a commit that referenced this pull request Nov 17, 2025
See gh-35477

Signed-off-by: potato <65760583+juntae6942@users.noreply.github.com>
bclozel added a commit that referenced this pull request Nov 17, 2025
@github-actions github-actions bot added the status: backported An issue that has been backported to maintenance branches label Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in: web Issues in web modules (web, webmvc, webflux, websocket) status: backported An issue that has been backported to maintenance branches type: bug A general bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HtmlUtils.htmlUnescape() incorrect for numeric character references >= &#x10000; / &#65536;

4 participants