Encoding issues with Eclipse WTP HTML format special chars #545

source-knights · 2020-03-23T22:31:38Z

Hi, I am using the maven spotless version 1.28.0 and Eclipe WTP 4.13.0 (but tried previous versions as well). I'm on windows 10. Tried 3 different developer machines, all showing same issue.

Whenever I use Eclipse WTP / Spotless to format HTML 5 files, the german special chars as in üöäÜÖÄß and the Euro sign € are changed to "Ã¼Ã¶Ã¤ÃœÃ–Ã„ÃŸâ‚¬". I understand that is actually the binary encoding of these chars if you would wrongly look at the file with non UTF-8 encoding. But as I use UTF-8 in all editors and in the HTML itself and in the spotless config, I don't understand why the files are changed to that by the formatter.

I managed to reprocude this in a simple maven project with only below pom.xml and the pasted HTML file.

Sample HTML5 file (which I save as UTF-8 in IDE, Eclipse, IntelliJ or even Notepad++ all leading to same problem).

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>üöäÜÖÄß€</title>
</head>
<body>
Test
</body>
</html>

My pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.sourceknights.test</groupId>
  <artifactId>spotlesstest</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>spotlesstest</name>
  
  <build>
    <plugins>
		  <plugin>
			  <groupId>com.diffplug.spotless</groupId>
			  <artifactId>spotless-maven-plugin</artifactId>
			  <version>1.28.0</version>
			  <configuration>
			  
			   <encoding>UTF-8</encoding>
			    
			    <formats>

				    <format>

             <encoding>UTF-8</encoding>

				      <includes>
				        <include>src/**/*.html</include>
				      </includes>
				
				      <eclipseWtp>
				        <!-- Specify the WTP formatter type (XML, JS, ...) -->
				        <type>HTML</type>
				        <!-- Optional, available versions: https://github.com/diffplug/spotless/tree/master/lib-extra/src/main/resources/com/diffplug/spotless/extra/eclipse_wtp_formatters -->
				        <version>4.13.0</version>
				      </eclipseWtp>
				    </format>
				  </formats>
			  </configuration>
		  </plugin>
    </plugins>
  </build>
</project>

Does anyone has an idea what I am doing wrong? All these specials chars are proper UTF-8 chars and allowed in HTML5, so they should not be changed.

Thxalot and stay healthy

The text was updated successfully, but these errors were encountered:

source-knights · 2020-03-23T22:35:45Z

Just to clarify, after formatting with mvn spotless:apply the HTML5 file is changed to

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Ã¼Ã¶Ã¤ÃœÃ–Ã„ÃŸâ‚¬</title>
</head>
<body>

</body>
</html>

nedtwigg · 2020-03-24T04:13:03Z

My untested suspicion is that this bug is specific to the Eclipse WTP formatter. e.g. you would not see this if you used the replace step. My guess is that somewhere in this shim code, we have to tell Eclipse to use UTF-8.

spotless/_ext/eclipse-wtp/src/main/java/com/diffplug/spotless/extra/eclipse/wtp/EclipseHtmlFormatterStepImpl.java

Lines 60 to 70 in 8aab108

    
           raw = super.format(raw); 
        
           // Not sure how Eclipse binds the JS formatter to HTML. The formatting is accomplished manually instead. 
        
           IStructuredDocument document = (IStructuredDocument) new HTMLDocumentLoader().createNewStructuredDocument(); 
        
           document.setPreferredLineDelimiter(LINE_DELIMITER); 
        
           document.set(raw); 
        
           StructuredDocumentProcessor<CodeFormatter> jsProcessor = new StructuredDocumentProcessor<CodeFormatter>( 
        
           		document, IHTMLPartitions.SCRIPT, JsRegionProcessor.createFactory(htmlFormatterIndent)); 
        
           jsProcessor.apply(jsFormatter); 
        
           return document.get();

Since we're only passing Strings back and forth, and java Strings are always unicode, then it shouldn't matter, but it wouldn't shock me if there is Eclipse code that roundtrips through binary while assuming an old charset unless you explicitly set it. But it's easy for us to make a test case that confirms whether or not this is Eclipse-WTP specific or not, and if it is, then there's not that many places to look for a fix. @fvgh does this seem plausible to you?

source-knights · 2020-03-24T08:16:11Z

Actually I currently use the replace step after the Eclipse WTP to put the special chars back in as a workaround. So that does not have an encoding problem, only the Eclipse WTP. Sadly the workaround leads to problems with line length, as temporary a line can go over the max line length and is wrapped when it should not due to the 2 chars for 1 special char.

I also tested Eclipse WTP with XML, that is fine and leaves the üöä as they are.

fvgh · 2020-03-24T08:17:00Z

Java uses internally UTF-16 (originally it used UCS-2, but to my understanding, they switched).
Spotless uses the configured encoding (in your case UTF-8) for reading and writing.
So according to your configuration, Spotless should do a UTF-8 to UTF-16 conversion for reading, and a revers conversion afterwards.
Could you provide a HEX dump of the input file?

When opening the modified file, be aware that neither WTP nor Spotless does add a byte order mark(BOM). If the input contains no BOM, the output contains no BOM.

*NIX users have the tendency not to care about the BOM. If any application sees an extension code, it look's up UTF extensions anyway.
For Windows developers the BOM is crucial, since some editors use still per default CP 1252, unless they find a BOM at the beginning of the file.
Be aware that a BOM is optional according to the standard, and I do by no means intend to encourage BOM usage.

Could you provide a HEX dump of the output file? I would like to check whether a BOM got lost or (as I expect) the output is a valid translation of the input without a BOM.

I expect that you switched all your IDE's to use UTF-8 per default, right? If not, I recommend it when you want to work with UTF. I had trouble in the past that a developer (using Jet-Brains editor) messed up a UTF-8 file, since there was no BOM.

fvgh · 2020-03-24T08:27:10Z

@source-knights Sorry, just found a mistake in my previous comment. I would like to see the HEX of input and output.

fvgh · 2020-03-24T08:30:26Z

@nedtwigg I added quickly a test on WTP side to deal with UTF-8 characters. There were no problems. But I must admit, I am not 100% sure that we handle a BOM correctly. Currently the reading/writing just passes the byte sequence on to the formatters. Not sure whether this is a good idea.

source-knights · 2020-03-24T08:57:18Z

Hi, here are the HEX contents. Please also see my comment above that the Eclipse WTP XML formatter does not have that issue.

Input (the one with correct üöäÜÖÄß€):

Output:

fvgh · 2020-03-24T11:23:17Z

@source-knights I may have found the problem. Could you use in the meantime the Java system property file.encoding with UTF-8? I am afraid I have no better work-around.

source-knights · 2020-03-24T11:53:08Z

I can confirm all fine when I use
mvn spotless:apply -Dfile.encoding=UTF-8

Thx for looking into this so quickly. Is there is anything I can do to help just shout

fvgh · 2020-03-24T16:46:54Z

Took the liberty to delete a few of my previous comments regarding error analysis. Was in a hurry and lacking caffeine. The comments were not correct

Spotless framework assures a conversion form the specified format to the internal format UTF-16. That's also problematic when it comes to the BOM, since Java does not strip it.

However @nedtwigg was right to suspect my WTP implementation. It always needs to use UTF-16, since Spotless already did the decoding, as I highlighted in my initial comment.
The BOM is currently stripped by the WTP. This should be discussed ion a separate issue, since it does not make sense to give it to the formatters in the first place as stated before.

source-knights · 2020-03-25T12:37:06Z

Thxalot for the quick fix. Now I just need typescript checks as a maven plugin... Will look into that later, maybe I can code it :)

nedtwigg · 2020-04-02T07:28:59Z

Fixed in gradle 3.28.1, maven 1.29.0

nedtwigg added the bug-unconfirmed label Mar 24, 2020

fvgh changed the title ~~Encoding issues with Eclipse WTP HTML format and german special chars~~ Encoding issues with Eclipse WTP HTML format special chars Mar 24, 2020

nedtwigg mentioned this issue Mar 24, 2020

Fix Eclipse WTP encoding handling #546

Merged

nedtwigg closed this as completed in 94010f4 Mar 26, 2020

fvgh added a commit that referenced this issue Mar 28, 2020

Replaced spotless-eclipse-wtp 3.15.2 by 3.15.3. Fixes #545.

27e94e6

fvgh mentioned this issue Mar 28, 2020

Replaced spotless-eclipse-wtp 3.15.2 by 3.15.3. Fixes #545. #550

Merged

nedtwigg pushed a commit that referenced this issue Mar 28, 2020

Replaced spotless-eclipse-wtp 3.15.2 by 3.15.3. Fixes #545.

64849c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issues with Eclipse WTP HTML format special chars #545

Encoding issues with Eclipse WTP HTML format special chars #545

source-knights commented Mar 23, 2020 •

edited

Loading

source-knights commented Mar 23, 2020 •

edited

Loading

nedtwigg commented Mar 24, 2020

source-knights commented Mar 24, 2020

fvgh commented Mar 24, 2020 •

edited

Loading

fvgh commented Mar 24, 2020

fvgh commented Mar 24, 2020

source-knights commented Mar 24, 2020 •

edited

Loading

fvgh commented Mar 24, 2020

source-knights commented Mar 24, 2020 •

edited

Loading

fvgh commented Mar 24, 2020 •

edited

Loading

source-knights commented Mar 25, 2020

nedtwigg commented Apr 2, 2020

Encoding issues with Eclipse WTP HTML format special chars #545

Encoding issues with Eclipse WTP HTML format special chars #545

Comments

source-knights commented Mar 23, 2020 • edited Loading

source-knights commented Mar 23, 2020 • edited Loading

nedtwigg commented Mar 24, 2020

source-knights commented Mar 24, 2020

fvgh commented Mar 24, 2020 • edited Loading

fvgh commented Mar 24, 2020

fvgh commented Mar 24, 2020

source-knights commented Mar 24, 2020 • edited Loading

fvgh commented Mar 24, 2020

source-knights commented Mar 24, 2020 • edited Loading

fvgh commented Mar 24, 2020 • edited Loading

source-knights commented Mar 25, 2020

nedtwigg commented Apr 2, 2020

source-knights commented Mar 23, 2020 •

edited

Loading

source-knights commented Mar 23, 2020 •

edited

Loading

fvgh commented Mar 24, 2020 •

edited

Loading

source-knights commented Mar 24, 2020 •

edited

Loading

source-knights commented Mar 24, 2020 •

edited

Loading

fvgh commented Mar 24, 2020 •

edited

Loading