Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 #20690

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

psawant19
Copy link

@psawant19 psawant19 commented Aug 23, 2024

Mapping ISO-8859-8-I charset to ISO-8859-8.
Below mentioned 2 aliases are added as part of this:-
ISO-8859-8-I
ISO8859-8-I

The bug report for the same:- https://bugs.openjdk.org/browse/JDK-8195686


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change requires a CSR request matching fixVersion 24 to be approved (needs to be created)

Issue

  • JDK-8195686: ISO-8859-8-i charset cannot be decoded, should be mapped to ISO-8859-8 (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20690/head:pull/20690
$ git checkout pull/20690

Update a local copy of the PR:
$ git checkout pull/20690
$ git pull https://git.openjdk.org/jdk.git pull/20690/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20690

View PR using the GUI difftool:
$ git pr show -t 20690

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20690.diff

Webrev

Link to Webrev Comment

Signed-off-by: Pratiksha.Sawant <Pratiksha.Sawant@ibm.com>
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 23, 2024

👋 Welcome back psawant19! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 23, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 23, 2024
@openjdk
Copy link

openjdk bot commented Aug 23, 2024

@psawant19 The following labels will be automatically applied to this pull request:

  • build
  • i18n
  • nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added build build-dev@openjdk.org nio nio-dev@openjdk.org i18n i18n-dev@openjdk.org labels Aug 23, 2024
@mlbridge
Copy link

mlbridge bot commented Aug 23, 2024

Webrevs

@psawant19
Copy link
Author

I have attached a test case for the charset issue.

Without the charset fix, below issue is seen:

ISO-8859-8I charset testing
ISO-8859-8 bytes: 1C 1E DF FE 3F FD 
Exception in thread "main" java.io.UnsupportedEncodingException: ISO-8859-8-I
	at java.base/java.lang.String.lookupCharset(String.java:861)
	at java.base/java.lang.String.getBytes(String.java:1795)
	at iso88598.main(iso88598.java:8)

After applying the fix, able to decode characters using ISO-8859-8-I charset.

ISO-8859-8I charset testing
ISO-8859-8 bytes: 1C 1E DF FE 3F FD 
ISO-8859-8-I bytes: 1C 1E DF FE 3F FD 
ISO8859-8-I bytes: 1C 1E DF FE 3F FD

iso88598.txt

@psawant19
Copy link
Author

@jaikiran, could you please review my PR.

@jaikiran
Copy link
Member

Hello Pratiksha, this is not an area that I have knowledge in. Naoto and Justin review changes in this area and I believe they will take a look at this when they are around.

Having said that, I notice that in your comment you mention that you ran a test with this change that fixes the issue. It looks like that was tested as a standalone application? Could you add that as a jtreg test to reproduce the issue and verify the fix?

@naotoj
Copy link
Member

naotoj commented Aug 23, 2024

Hi,
Could you please elaborate the rationale to implement this encoding? As ISO-8859 encodings are pretty much obsolete, not sure it is worth adding this encoding now.

@jmehrens
Copy link

jmehrens commented Aug 24, 2024

@naotoj The origin comes from an old JavaMail ticket that Bill Shannon was working on. The link is here: jakartaee/mail-api#302

I'm picking up where Bill left off and @psawant19 is just addressing the matching jdk bug. I can add the mappings to JakaraMail but Bill wanted the evaluation of root issue in the JDK before doing that. The history is in the linked ticket.

@AlanBateman
Copy link
Contributor

/csr

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Aug 25, 2024
@openjdk
Copy link

openjdk bot commented Aug 25, 2024

@AlanBateman has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@psawant19 please create a CSR request for issue JDK-8195686 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

@AlanBateman
Copy link
Contributor

AlanBateman commented Aug 25, 2024

I've added the "csr" label as this is adding support for "ISO8859-8-I".

Naoto asked me about it but I'm not 100% sure if it's an alias or a different charset. I think this topic may require input from those more familiar with charsets in environment that require bidi processing. Or if the mappings are available then I think we can see if they are identical to ISO8859-8.

@psawant19
Copy link
Author

psawant19 commented Aug 26, 2024

"ISO-8859-8-I" is a charset name for character encoding "ISO-8859-8".(https://en.wikipedia.org/wiki/ISO-8859-8-I).

We had found 2 files where the aliases for charsets are added in jdk code base.

“src/java.xml/share/classes/com/sun/org/apache/xerces/internal/util/EncodingMap.java”
“/make/data/charsetmapping/charsets”

“ISO-8859-8-I” charset is referenced in the headers as the charset of the email contents in few clients when the email is generated from Middle East and China. As it is supposed to be a duplicate of ISO-8859-8, and we are supporting this ISO-8859-8-I in EncodingMap.java, supporting this encoding in charsets file also makes the behaviour consistent through the JDK.

There is a ticket raised in angus-mail for similar issue :- eclipse-ee4j/angus-mail#147

@magicus
Copy link
Member

magicus commented Aug 26, 2024

/label -build

@openjdk openjdk bot removed the build build-dev@openjdk.org label Aug 26, 2024
@openjdk
Copy link

openjdk bot commented Aug 26, 2024

@magicus
The build label was successfully removed.

@naotoj
Copy link
Member

naotoj commented Aug 27, 2024

I looked at this issue a bit more. Looking at the IANA Charset registry (https://www.iana.org/assignments/character-sets/character-sets.xhtml) which Charset class is based on, ISO-8859-8-I is not an alias to ISO-8859-8, but it is defined as a distinct Preferred MIME name. So I don't think current proposed solution is correct. (It would return ISO-8859-8-I as an alias to ISO-8859-8). Also, looking at the RFC-1556, in which this ISO-8859-8-I encoding is defined, there are other encodings, i.e., ISO-8859-6-I, ISO-8859-6-E, and ISO-8859-8-E. Why are they not relevant, but ISO-8859-8-I is?
Considering these, I am still not sure to introduce these new encodings now, also because there has not been any request from the time Bill Shannon worked (circa 2018), unless Arabic/Hebrew speaking communities jumped in and provide rationale to support them.

@jmehrens
Copy link

@naotoj does the mapping need to be removed from:

aIANA2JavaMap.put("ISO-8859-8-I", "ISO8859_8"); // added since this encoding only differs w.r.t. presentation

I ask because JakartaMail /Angus Mail is a similar usecase to this code.

@naotoj
Copy link
Member

naotoj commented Sep 13, 2024

@jmehrens I would like to, but I don't know the possible issues that would be caused by the removal. So my take is no.

@jmehrens
Copy link

@naotoj Makes sense. I did find a few links:

https://blog.netbsd.org/tnf/entry/handling_non_utf_8_hebrew

https://support.oracle.com/knowledge/Oracle%20Cloud/2991085_1.html

Any advice on adding the alias to JakartaMail? I see web search results of libraries using what is done in xerces so I'm trying to balance your advice with that.

@naotoj
Copy link
Member

naotoj commented Sep 16, 2024

Sorry, but I cannot speak for Jakarta Mail. If they see ISO-8859-8-I encoding important, they may introduce it as a new charset (again it is not an alias to ISO-8859-8)

@jmehrens
Copy link

@naotoj

Sorry, but I cannot speak for Jakarta Mail. If they see ISO-8859-8-I encoding important, they may introduce it as a new charset (again it is not an alias to ISO-8859-8)

Understood. I'll close out those tickets then with alternatives.

...Considering these, I am still not sure to introduce these new encodings now, also because there has not been any request from the time Bill Shannon worked (circa 2018)

Well that is not exactly true. The following are all the same ticket from 2018 as a request from JavaMail/JakartaMail:

The OpenJDK ticket JDK-8195686 has not had a proper evaluation since 2018. However, looks like this PR has that covered and I'm grateful for that.

Then in May of 2024 the following was created:

eclipse-ee4j/angus-mail#147 by @davecrighton on the Angus Mail project. Then in June @psawant19 commented on that ticket and later created this PR in OpenJDK. So 3 unique users and all related to JavaMail/JakartaMail/Angus Mail on this very topic.

It seems pretty clear that we would have to contribute the new Charset implementation to move this forward.

@jmiserez
Copy link

jmiserez commented Sep 17, 2024

As the original bug submitter I might add that adding a mapping from ISO-8859-8-i to ISO-8859-8 is almost certainly correct and makes sense in the real world.

The character encodings for ISO-8859-8 and ISO-8859-8-i charsets are exactly the same, and the distinction is only due to historical reasons.

Email clients in the past did not "know about" right-to-left languages, instead the text was sent as regular ISO-8859-8 mail but sent line-by-line but with each line reversed. The reversed lines are displayed LTR (left-to-right) as-is. This is what's known as "visual ordering", and is required for old email clients.

Newer email clients can do right-to-left, i.e. their text display engines started to support RTL display. So it was no longer necessary to send emails in "visual order" with reversed lines. But now there's a problem: how does the email client know whether the text is in "visual order" (displayed as-is LTR) or in "logical order" (displayed as RTL text).

Thus ISO-8859-8-i was introduced. The charset decoding is exactly the same as ISO-8859-8, the only difference is in instructing the email client to display the lines not as-is LTR, but RTL (more precisely the "-i" stands for "implicit mode", where the directionality depends on the content).

Old email clients cannot show these mails, as they do not know about ISO-8859-8-i and do not support RTL display anyways.

(Sidenote: there are also "ISO-8859-8" mails in the wild that are actually in logical order already. RTL applications are pretty good at figuring this out heuristically nowadays.)

The only drawback to adding the alias from ISO-8859-8-i to ISO-8859-8 is if you have a very old application (email client) that cannot do RTL display , doesn't look at the charset, has no heuristics for RTL, but used the newest JDK. Instead of showing an "unsupported charset" error it would then read the email as LTR with each line reversed.

@psawant19
Copy link
Author

psawant19 commented Sep 18, 2024

Based on our analysis, we've identified that the file “EncodingMap.java” includes an entry where "ISO-8859-8-I" is defined as an alias for "ISO8859_8." This entry is found in the headstream repository, and we believe it makes sense to include this in the charsets file as well.

Moreover, the original bug submitter, jmiserez has expressed agreement with our proposed solution, as noted in the discussion here.

Even if we decide to create a new charset mapping for "ISO-8859-8-I," it would essentially mirror "ISO-8859-8," differing only in the naming convention. This would function similarly to creating an alias in the charsets file.

Therefore, we propose that this approach is valid and appropriate for implementation.

@davecrighton
Copy link

davecrighton commented Oct 3, 2024

@naotoj In light of @jmiserez and @psawant19 's comments does this change the position of the openjdk team?

We are interested as we are currently maintaining a fork of Jakarta mail in order to allow our customers to use this charset and would like to limit the amount of time we need to do this for.

For what it is worth our customer has deployed this into production and is successfully processing ISO-8859-8-i without any complaints from users.

Appreciate your work on reviewing this.

@jmiserez
Copy link

jmiserez commented Oct 3, 2024

One more thing: I forgot to explain why the alias ISO-8859-8-i -> ISO-8859-8 would definitely be correct.

Java strings are stored in logical order. That is true for both LTR and RTL languages. This is plainly apparent from the OpenJDK String source code, but also explicitly mentioned/explained e.g. by official tutorials such as here: https://docs.oracle.com/javase/tutorial/2d/text/textlayoutbidirectionaltext.html#ordering_text

ISO-8859-8-i texts are always sent in logical order (by definition). So decoding a ISO-8859-8-i text into a Java string using the ISO-8859-8 alias will result in the correct order of characters in the Java string, i.e. logical order, and thus is always 100% correct by definition.

Technically speaking, and for completeness sake here is the full list of cases for regular ISO-8859-8 today:

  1. ISO-8859-8 texts may contain either LTR language content, in which case the text is correctly decoded to a Java string in logical order. -> OK
  2. ISO-8859-8 texts may also contain RTL language content in logical order (newer applications already do this), in which case the text is also correctly decoded to a Java string in logical order. -> OK.
  3. But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

So: Case 2 is what would happen if the alias was added. Now as long as nobody adds a "auto-reverse visual to logical order" heuristic for RTL ISO-8859-8 text decoding in OpenJDK (which I'm fairly certain can't / mustn't be done), using a simple alias ISO-8859-8-i -> ISO-8859-8 will thus always be correct. The alias will result in case 2, i.e. texts will always be decoded into the correct Java string in logical order.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is the right way to handle it.

As ISO-8859 encodings are pretty much obsolete, not sure it is worth adding this encoding now.

Yes, but ISO-8859-8-I is still referenced by WHATWG as well. It's up to an application layer to make a distinction as far as the visual or logical order, it doesn't make sense for a converter to try to do anything.

Anyway, it would be ISO-8859-8-E which would have explicit visual controls in it. ISO-8859-8-I as an encoding will match exactly what data in the wild for ISO-8859-8.

IBM and ICU's mapping tables have had this equivalent for 25+ years. Merging this PR corrects the oversight in the ISO-8859-8 compatibility.

I think it would be fine to say that ISO-8859-8-E is not supported here, as it would be ISO-8859-8 / ISO-8859-8-I but with additional controls requiring a shaper. That could be mentioned in a comment.

@srl295
Copy link
Member

srl295 commented Oct 9, 2024

@jmiserez wrote:

But: If a ISO-8859-8 text contains RTL language content in visual order (old applications, historically the case), the text would be decoded to a Java string in visual order. This is actually technically incorrect and may be a source of bugs (e.g. concatenation won't work correctly). However this behavior cannot be changed in OpenJDK anymore as (old) applications may rely on it.

In other words, Java may have been incorrectly handling ISO-8859-8 all this time if content was in visual order. Putting in this alias means that ISO-8859-8-I will be handled correctly.

@psawant19
Copy link
Author

Thank you @jmiserez and @srl295 for the approval.

@psawant19
Copy link
Author

@AlanBateman Since the mapping is just an alias to ISO-8859-8 do we still need CSR request to be created for the pull request?

@srl295
Copy link
Member

srl295 commented Oct 10, 2024

@naotoj does it make sense?

@naotoj
Copy link
Member

naotoj commented Oct 10, 2024

@naotoj does it make sense?

Sorry, but I still don't believe that making "ISO-8859-8-I" as an alias to "ISO-8859-8" is the right solution, per the IANA character sets definition (https://www.iana.org/assignments/character-sets/character-sets.xhtml). The current PR would make "ISO-8859-8-I" charset appear in Charset.forName("ISO-8859-8").aliases(), but not in Charset.availableCharsets() which is deemed incorrect to me.

That said, I just wonder if this issue can better be addressed exploiting the Charset SPI. This way mail servers can install "ISO-8859-8-I" charset by themselves. This means that mail servers do not need to rely on the underlying JDK which may or may not have that charset.

@justin-curtis-lu
Copy link
Member

justin-curtis-lu commented Oct 10, 2024

Sorry, but I still don't believe that making "ISO-8859-8-I" as an alias to "ISO-8859-8" is the right solution, per the IANA character sets definition (https://www.iana.org/assignments/character-sets/character-sets.xhtml). The current PR would make "ISO-8859-8-I" charset appear in Charset.forName("ISO-8859-8").aliases(), but not in Charset.availableCharsets() which is deemed incorrect to me.

I agree. From the Charset specification,

If a charset listed in the IANA Charset Registry is supported by an implementation of the Java platform then its canonical name must be the name listed in the registry. Many charsets are given more than one name in the registry, in which case the registry identifies one of the names as MIME-preferred. If a charset has more than one registry name then its canonical name must be the MIME-preferred name and the other names in the registry must be valid aliases.

Practically speaking it does seem to be an alias, but implementing as such would violate the Charset specification. So either defining as a new Charset for ISO-8859-8-I (if there is sufficient demand) or as Naoto pointed out, utilize the CharsetProvider would seem like appropriate solutions to me. A pro to the SPI solution is that you can also easily include all the other bidi supported implicit/explicit Charsets as well.

@srl295
Copy link
Member

srl295 commented Oct 10, 2024 via email

@jmehrens
Copy link

...(if there is sufficient demand)...

I don't fully understand the conditional acceptance. Can't @psawant19 abandon the alias PR and use the existing ISO-8859-8 source from OpenJDK to create new ISO-8859-8-I Charset? The level off effort to share common code, proxy wrap, or so forth between two Charsets wouldn't be that much of a lift or long term debt. If the community is willing to to the work then acceptance is really a willingness to approve the change. Are all housed OpenJDK solutions around this a no?

Naoto pointed out, utilize the CharsetProvider would seem like appropriate solutions to me.

That has been the solution suggested for years. They have been documented JavaMail/JakartaMail FAQ. I copied them into the ticket here:
eclipse-ee4j/angus-mail#147 (comment)

I'll leave that AngusMail ticket open until this comes to a close.

@srl295
Copy link
Member

srl295 commented Oct 11, 2024

Can't @psawant19 abandon the alias PR and use the existing ISO-8859-8 source from OpenJDK to create new ISO-8859-8-I Charset?

That would seem to be what @naotoj stated would make the API contract (concerning IANA identity) correct.

@jmehrens
Copy link

That would seem to be what @naotoj stated would make the API contract (concerning IANA identity) correct.

Correct. I gathered that point. What I was trying to convey is that the contribution of the intellectual property is from OpenJDK itself so there is proven track record of quality of the code.

Alias route is dead, done, rejected. Rejecting a PR on that route that is a 'clone of another charset' is either compatiblely concern or a unwillingness to accept the new charset.

Just trying to find a path forward on this. Thus my intent is to figure out why charset approach would be rejected on the grounds that ISO-8859-8-I is "obsolete", does not have "sufficient demand", or is not "important" enough. These are reject words sprinked in thread.

Contributors are here to help out work on this. Working on obsolete, unpopular, unimportant stuff is what we do sometimes. We just need direction.

@srl295
Copy link
Member

srl295 commented Oct 11, 2024

I noticed that the embedded xerces treates 8859-8-I as 8859-8 here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csr Pull request needs approved CSR before integration i18n i18n-dev@openjdk.org nio nio-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

10 participants