8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 #20690

psawant19 · 2024-08-23T10:38:38Z

Mapping ISO-8859-8-I charset to ISO-8859-8.
Below mentioned 2 aliases are added as part of this:-
ISO-8859-8-I
ISO8859-8-I

The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue
Change requires a CSR request matching fixVersion 24 to be approved (needs to be created)

Issue

JDK-8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 (Enhancement - P4)

Reviewers

Steven Loomis (@srl295 - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20690/head:pull/20690
$ git checkout pull/20690

Update a local copy of the PR:
$ git checkout pull/20690
$ git pull https://git.openjdk.org/jdk.git pull/20690/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20690

View PR using the GUI difftool:
$ git pr show -t 20690

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20690.diff

Webrev

Link to Webrev Comment

Signed-off-by: Pratiksha.Sawant <Pratiksha.Sawant@ibm.com>

bridgekeeper · 2024-08-23T10:39:25Z

👋 Welcome back psawant19! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-08-23T10:40:05Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2024-08-23T10:40:32Z

@psawant19 The following labels will be automatically applied to this pull request:

build
i18n
nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2024-08-23T10:43:30Z

Webrevs

00: Full (5eb2f6ca)

psawant19 · 2024-08-23T11:06:00Z

I have attached a test case for the charset issue.

Without the charset fix, below issue is seen:

ISO-8859-8I charset testing
ISO-8859-8 bytes: 1C 1E DF FE 3F FD 
Exception in thread "main" java.io.UnsupportedEncodingException: ISO-8859-8-I
	at java.base/java.lang.String.lookupCharset(String.java:861)
	at java.base/java.lang.String.getBytes(String.java:1795)
	at iso88598.main(iso88598.java:8)

After applying the fix, able to decode characters using ISO-8859-8-I charset.

ISO-8859-8I charset testing
ISO-8859-8 bytes: 1C 1E DF FE 3F FD 
ISO-8859-8-I bytes: 1C 1E DF FE 3F FD 
ISO8859-8-I bytes: 1C 1E DF FE 3F FD

iso88598.txt

psawant19 · 2024-08-23T11:07:47Z

@jaikiran, could you please review my PR.

jaikiran · 2024-08-23T11:16:24Z

Hello Pratiksha, this is not an area that I have knowledge in. Naoto and Justin review changes in this area and I believe they will take a look at this when they are around.

Having said that, I notice that in your comment you mention that you ran a test with this change that fixes the issue. It looks like that was tested as a standalone application? Could you add that as a jtreg test to reproduce the issue and verify the fix?

naotoj · 2024-08-23T16:43:18Z

Hi,
Could you please elaborate the rationale to implement this encoding? As ISO-8859 encodings are pretty much obsolete, not sure it is worth adding this encoding now.

jmehrens · 2024-08-24T01:59:14Z

@naotoj The origin comes from an old JavaMail ticket that Bill Shannon was working on. The link is here: jakartaee/mail-api#302

I'm picking up where Bill left off and @psawant19 is just addressing the matching jdk bug. I can add the mappings to JakaraMail but Bill wanted the evaluation of root issue in the JDK before doing that. The history is in the linked ticket.

AlanBateman · 2024-08-25T07:13:48Z

/csr

openjdk · 2024-08-25T07:14:20Z

@AlanBateman has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@psawant19 please create a CSR request for issue JDK-8195686 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

AlanBateman · 2024-08-25T07:21:10Z

I've added the "csr" label as this is adding support for "ISO8859-8-I".

Naoto asked me about it but I'm not 100% sure if it's an alias or a different charset. I think this topic may require input from those more familiar with charsets in environment that require bidi processing. Or if the mappings are available then I think we can see if they are identical to ISO8859-8.

psawant19 · 2024-08-26T13:03:17Z

"ISO-8859-8-I" is a charset name for character encoding "ISO-8859-8".(https://en.wikipedia.org/wiki/ISO-8859-8-I).

We had found 2 files where the aliases for charsets are added in jdk code base.

“src/java.xml/share/classes/com/sun/org/apache/xerces/internal/util/EncodingMap.java”
“/make/data/charsetmapping/charsets”

“ISO-8859-8-I” charset is referenced in the headers as the charset of the email contents in few clients when the email is generated from Middle East and China. As it is supposed to be a duplicate of ISO-8859-8, and we are supporting this ISO-8859-8-I in EncodingMap.java, supporting this encoding in charsets file also makes the behaviour consistent through the JDK.

There is a ticket raised in angus-mail for similar issue :- eclipse-ee4j/angus-mail#147

magicus · 2024-08-26T13:13:35Z

/label -build

openjdk · 2024-08-26T13:14:13Z

@magicus
The build label was successfully removed.

naotoj · 2024-08-27T17:01:19Z

I looked at this issue a bit more. Looking at the IANA Charset registry (https://www.iana.org/assignments/character-sets/character-sets.xhtml) which Charset class is based on, ISO-8859-8-I is not an alias to ISO-8859-8, but it is defined as a distinct Preferred MIME name. So I don't think current proposed solution is correct. (It would return ISO-8859-8-I as an alias to ISO-8859-8). Also, looking at the RFC-1556, in which this ISO-8859-8-I encoding is defined, there are other encodings, i.e., ISO-8859-6-I, ISO-8859-6-E, and ISO-8859-8-E. Why are they not relevant, but ISO-8859-8-I is?
Considering these, I am still not sure to introduce these new encodings now, also because there has not been any request from the time Bill Shannon worked (circa 2018), unless Arabic/Hebrew speaking communities jumped in and provide rationale to support them.

jmehrens · 2024-09-13T03:12:39Z

@naotoj does the mapping need to be removed from:

jdk/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/util/EncodingMap.java

Line 770 in 5e5942a

    
           aIANA2JavaMap.put("ISO-8859-8-I",      "ISO8859_8"); // added since this encoding only differs w.r.t. presentation

I ask because JakartaMail /Angus Mail is a similar usecase to this code.

naotoj · 2024-09-13T16:31:21Z

@jmehrens I would like to, but I don't know the possible issues that would be caused by the removal. So my take is no.

jmehrens · 2024-09-13T23:52:30Z

@naotoj Makes sense. I did find a few links:

https://blog.netbsd.org/tnf/entry/handling_non_utf_8_hebrew

https://support.oracle.com/knowledge/Oracle%20Cloud/2991085_1.html

Any advice on adding the alias to JakartaMail? I see web search results of libraries using what is done in xerces so I'm trying to balance your advice with that.

naotoj · 2024-09-16T17:40:38Z

Sorry, but I cannot speak for Jakarta Mail. If they see ISO-8859-8-I encoding important, they may introduce it as a new charset (again it is not an alias to ISO-8859-8)

jmehrens · 2024-09-17T03:01:02Z

@naotoj

Sorry, but I cannot speak for Jakarta Mail. If they see ISO-8859-8-I encoding important, they may introduce it as a new charset (again it is not an alias to ISO-8859-8)

Understood. I'll close out those tickets then with alternatives.

...Considering these, I am still not sure to introduce these new encodings now, also because there has not been any request from the time Bill Shannon worked (circa 2018)

Well that is not exactly true. The following are all the same ticket from 2018 as a request from JavaMail/JakartaMail:

https://bugs.openjdk.org/browse/JDK-8195686 (Ticket created by JavaMail user against OpenJDK by @jmiserez)
ISO-8859-8-i should be mapped to ISO-8859-8 (Hebrew, see caveats) jakartaee/mail-api#302 (migrated ticket from JavaMail to JakartaMail)
https://github.com/javaee/javamail/issues/302 (referenced in JDK-8195686 by @jmiserez)

The OpenJDK ticket JDK-8195686 has not had a proper evaluation since 2018. However, looks like this PR has that covered and I'm grateful for that.

Then in May of 2024 the following was created:

eclipse-ee4j/angus-mail#147 by @davecrighton on the Angus Mail project. Then in June @psawant19 commented on that ticket and later created this PR in OpenJDK. So 3 unique users and all related to JavaMail/JakartaMail/Angus Mail on this very topic.

It seems pretty clear that we would have to contribute the new Charset implementation to move this forward.

jmiserez · 2024-09-17T05:29:33Z

As the original bug submitter I might add that adding a mapping from ISO-8859-8-i to ISO-8859-8 is almost certainly correct and makes sense in the real world.

The character encodings for ISO-8859-8 and ISO-8859-8-i charsets are exactly the same, and the distinction is only due to historical reasons.

Email clients in the past did not "know about" right-to-left languages, instead the text was sent as regular ISO-8859-8 mail but sent line-by-line but with each line reversed. The reversed lines are displayed LTR (left-to-right) as-is. This is what's known as "visual ordering", and is required for old email clients.

Newer email clients can do right-to-left, i.e. their text display engines started to support RTL display. So it was no longer necessary to send emails in "visual order" with reversed lines. But now there's a problem: how does the email client know whether the text is in "visual order" (displayed as-is LTR) or in "logical order" (displayed as RTL text).

Thus ISO-8859-8-i was introduced. The charset decoding is exactly the same as ISO-8859-8, the only difference is in instructing the email client to display the lines not as-is LTR, but RTL (more precisely the "-i" stands for "implicit mode", where the directionality depends on the content).

Old email clients cannot show these mails, as they do not know about ISO-8859-8-i and do not support RTL display anyways.

(Sidenote: there are also "ISO-8859-8" mails in the wild that are actually in logical order already. RTL applications are pretty good at figuring this out heuristically nowadays.)

The only drawback to adding the alias from ISO-8859-8-i to ISO-8859-8 is if you have a very old application (email client) that cannot do RTL display , doesn't look at the charset, has no heuristics for RTL, but used the newest JDK. Instead of showing an "unsupported charset" error it would then read the email as LTR with each line reversed.

psawant19 · 2024-09-18T07:58:57Z

Based on our analysis, we've identified that the file “EncodingMap.java” includes an entry where "ISO-8859-8-I" is defined as an alias for "ISO8859_8." This entry is found in the headstream repository, and we believe it makes sense to include this in the charsets file as well.

Moreover, the original bug submitter, jmiserez has expressed agreement with our proposed solution, as noted in the discussion here.

Even if we decide to create a new charset mapping for "ISO-8859-8-I," it would essentially mirror "ISO-8859-8," differing only in the naming convention. This would function similarly to creating an alias in the charsets file.

Therefore, we propose that this approach is valid and appropriate for implementation.

davecrighton · 2024-10-03T08:06:16Z

@naotoj In light of @jmiserez and @psawant19 's comments does this change the position of the openjdk team?

We are interested as we are currently maintaining a fork of Jakarta mail in order to allow our customers to use this charset and would like to limit the amount of time we need to do this for.

For what it is worth our customer has deployed this into production and is successfully processing ISO-8859-8-i without any complaints from users.

Appreciate your work on reviewing this.

jmiserez · 2024-10-03T08:52:01Z

One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 would definitely be correct.

Java strings are stored in logical order. That is true for both LTR and RTL languages. This is plainly apparent from the OpenJDK String source code, but also explicitly mentioned/explained e.g. by official tutorials such as here: https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text

ISO-8859-8-i texts are always sent in logical order (by definition). So decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will result in the correct order of characters in the Java string, i.e. logical order, and thus is always 100% correct by definition.

Technically speaking, and for completeness sake here is the full list of cases for regular ISO-8859-8 today:

ISO-8859-8 texts may contain either LTR language content, in which case the text is correctly decoded to a Java string in logical order. -> OK
ISO-8859-8 texts may also contain RTL language content in logical order (newer applications already do this), in which case the text is also correctly decoded to a Java string in logical order. -> OK.
But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

So: Case 2 is what would happen if the alias was added. Now as long as nobody adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. The alias will result in case 2, i.e. texts will always be decoded into the correct Java string in logical order.

srl295

This PR is the right way to handle it.

As ISO-8859 encodings are pretty much obsolete, not sure it is worth adding this encoding now.

Yes, but ISO-8859-8-I is still referenced by WHATWG as well. It's up to an application layer to make a distinction as far as the visual or logical order, it doesn't make sense for a converter to try to do anything.

Anyway, it would be ISO-8859-8-E which would have explicit visual controls in it. ISO-8859-8-I as an encoding will match exactly what data in the wild for ISO-8859-8.

IBM and ICU's mapping tables have had this equivalent for 25+ years. Merging this PR corrects the oversight in the ISO-8859-8 compatibility.

I think it would be fine to say that ISO-8859-8-E is not supported here, as it would be ISO-8859-8 / ISO-8859-8-I but with additional controls requiring a shaper. That could be mentioned in a comment.

srl295 · 2024-10-09T20:24:37Z

@jmiserez wrote:

But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

In other words, Java may have been incorrectly handling ISO-8859-8 all this time if content was in visual order. Putting in this alias means that ISO-8859-8-I will be handled correctly.

psawant19 · 2024-10-10T05:23:45Z

Thank you @jmiserez and @srl295 for the approval.

psawant19 · 2024-10-10T05:30:41Z

@AlanBateman Since the mapping is just an alias to ISO-8859-8 do we still need CSR request to be created for the pull request?

srl295 · 2024-10-10T11:09:28Z

@naotoj does it make sense?

naotoj · 2024-10-10T16:32:48Z

@naotoj does it make sense?

Sorry, but I still don't believe that making "ISO-8859-8-I" as an alias to "ISO-8859-8" is the right solution, per the IANA character sets definition (https://www.iana.org/assignments/character-sets/character-sets.xhtml). The current PR would make "ISO-8859-8-I" charset appear in Charset.forName("ISO-8859-8").aliases(), but not in Charset.availableCharsets() which is deemed incorrect to me.

That said, I just wonder if this issue can better be addressed exploiting the Charset SPI. This way mail servers can install "ISO-8859-8-I" charset by themselves. This means that mail servers do not need to rely on the underlying JDK which may or may not have that charset.

justin-curtis-lu · 2024-10-10T16:51:09Z

Sorry, but I still don't believe that making "ISO-8859-8-I" as an alias to "ISO-8859-8" is the right solution, per the IANA character sets definition (https://www.iana.org/assignments/character-sets/character-sets.xhtml). The current PR would make "ISO-8859-8-I" charset appear in Charset.forName("ISO-8859-8").aliases(), but not in Charset.availableCharsets() which is deemed incorrect to me.

I agree. From the Charset specification,

If a charset listed in the IANA Charset Registry is supported by an implementation of the Java platform then its canonical name must be the name listed in the registry. Many charsets are given more than one name in the registry, in which case the registry identifies one of the names as MIME-preferred. If a charset has more than one registry name then its canonical name must be the MIME-preferred name and the other names in the registry must be valid aliases.

Practically speaking it does seem to be an alias, but implementing as such would violate the Charset specification. So either defining as a new Charset for ISO-8859-8-I (if there is sufficient demand) or as Naoto pointed out, utilize the CharsetProvider would seem like appropriate solutions to me. A pro to the SPI solution is that you can also easily include all the other bidi supported implicit/explicit Charsets as well.

srl295 · 2024-10-10T19:51:27Z

Fair enough and feel free to reject my review if need be. It seems like from an API perspective, you are both saying it should be a new Charset provider (though with identical behavior) but separate and not an alias. That preserves the invariant about IANA registration. It does still seem that the JDK is probably currently treating ISO-8859-8 as if it were ISO-8859-8-I. I wonder why the implementation was done the way it is, but that’s only of historical interest.

jmehrens · 2024-10-11T13:36:49Z

...(if there is sufficient demand)...

I don't fully understand the conditional acceptance. Can't @psawant19 abandon the alias PR and use the existing ISO-8859-8 source from OpenJDK to create new ISO-8859-8-I Charset? The level off effort to share common code, proxy wrap, or so forth between two Charsets wouldn't be that much of a lift or long term debt. If the community is willing to to the work then acceptance is really a willingness to approve the change. Are all housed OpenJDK solutions around this a no?

Naoto pointed out, utilize the CharsetProvider would seem like appropriate solutions to me.

That has been the solution suggested for years. They have been documented JavaMail/JakartaMail FAQ. I copied them into the ticket here:
eclipse-ee4j/angus-mail#147 (comment)

I'll leave that AngusMail ticket open until this comes to a close.

srl295 · 2024-10-11T13:41:29Z

Can't @psawant19 abandon the alias PR and use the existing ISO-8859-8 source from OpenJDK to create new ISO-8859-8-I Charset?

That would seem to be what @naotoj stated would make the API contract (concerning IANA identity) correct.

jmehrens · 2024-10-11T15:55:06Z

That would seem to be what @naotoj stated would make the API contract (concerning IANA identity) correct.

Correct. I gathered that point. What I was trying to convey is that the contribution of the intellectual property is from OpenJDK itself so there is proven track record of quality of the code.

Alias route is dead, done, rejected. Rejecting a PR on that route that is a 'clone of another charset' is either compatiblely concern or a unwillingness to accept the new charset.

Just trying to find a path forward on this. Thus my intent is to figure out why charset approach would be rejected on the grounds that ISO-8859-8-I is "obsolete", does not have "sufficient demand", or is not "important" enough. These are reject words sprinked in thread.

Contributors are here to help out work on this. Working on obsolete, unpopular, unimportant stuff is what we do sometimes. We just need direction.

srl295 · 2024-10-11T15:59:25Z

I noticed that the embedded xerces treates 8859-8-I as 8859-8 here:

jdk/src/java.xml/share/classes/com/sun/org/apache/xerces/internal/util/EncodingMap.java

Line 770 in 7276a1b

aIANA2JavaMap.put("ISO-8859-8-I", "ISO8859_8"); // added since this encoding only differs w.r.t. presentation

Mapping ISO-8859-8-I charset to ISO-8859-8.

5eb2f6c

Signed-off-by: Pratiksha.Sawant <Pratiksha.Sawant@ibm.com>

openjdk bot added the rfr Pull request is ready for review label Aug 23, 2024

openjdk bot added build build-dev@openjdk.org nio nio-dev@openjdk.org i18n i18n-dev@openjdk.org labels Aug 23, 2024

openjdk bot added the csr Pull request needs approved CSR before integration label Aug 25, 2024

openjdk bot removed the build build-dev@openjdk.org label Aug 26, 2024

jmehrens mentioned this pull request Aug 29, 2024

Continuation of #107 UnsupportedEncodingException eclipse-ee4j/angus-mail#147

Open

srl295 approved these changes Oct 9, 2024

View reviewed changes

8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 #20690

Are you sure you want to change the base?

8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 #20690

Conversation

psawant19 commented Aug 23, 2024 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented Aug 23, 2024

openjdk bot commented Aug 23, 2024

openjdk bot commented Aug 23, 2024

mlbridge bot commented Aug 23, 2024

Webrevs

psawant19 commented Aug 23, 2024

psawant19 commented Aug 23, 2024

jaikiran commented Aug 23, 2024

naotoj commented Aug 23, 2024

jmehrens commented Aug 24, 2024 • edited Loading

AlanBateman commented Aug 25, 2024

openjdk bot commented Aug 25, 2024

AlanBateman commented Aug 25, 2024 • edited Loading

psawant19 commented Aug 26, 2024 • edited Loading

magicus commented Aug 26, 2024

openjdk bot commented Aug 26, 2024

naotoj commented Aug 27, 2024

jmehrens commented Sep 13, 2024

naotoj commented Sep 13, 2024

jmehrens commented Sep 13, 2024

naotoj commented Sep 16, 2024

jmehrens commented Sep 17, 2024

jmiserez commented Sep 17, 2024 • edited Loading

psawant19 commented Sep 18, 2024 • edited Loading

davecrighton commented Oct 3, 2024 • edited by bridgekeeper bot Loading

jmiserez commented Oct 3, 2024 • edited Loading

srl295 left a comment • edited Loading

Choose a reason for hiding this comment

srl295 commented Oct 9, 2024

psawant19 commented Oct 10, 2024

psawant19 commented Oct 10, 2024

srl295 commented Oct 10, 2024

naotoj commented Oct 10, 2024

justin-curtis-lu commented Oct 10, 2024 • edited Loading

srl295 commented Oct 10, 2024 via email

jmehrens commented Oct 11, 2024

srl295 commented Oct 11, 2024

jmehrens commented Oct 11, 2024

srl295 commented Oct 11, 2024

psawant19 commented Aug 23, 2024 •

edited by openjdk bot

Loading

jmehrens commented Aug 24, 2024 •

edited

Loading

AlanBateman commented Aug 25, 2024 •

edited

Loading

psawant19 commented Aug 26, 2024 •

edited

Loading

jmiserez commented Sep 17, 2024 •

edited

Loading

psawant19 commented Sep 18, 2024 •

edited

Loading

davecrighton commented Oct 3, 2024 •

edited by bridgekeeper bot

Loading

jmiserez commented Oct 3, 2024 •

edited

Loading

srl295 left a comment •

edited

Loading

justin-curtis-lu commented Oct 10, 2024 •

edited

Loading