Skip to content

Conversation

@alexdowad
Copy link
Contributor

For JIS encoding, hiragana and katakana can be input in multiple forms. One form uses JISX 0201 escape sequences. Another is called 'GR-invoked' kana.

In the context of ISO-2022 encoding, bytes with a zero bit in the MSB are called "GL" (or "graphics left") and those with the MSB set are called "GR" (or "graphics right"). Regarding the variants of ISO-2022-JP which are called "JIS7" and "JIS8", Wikipedia states:

"Other, older variants known as JIS7 and JIS8 build directly on the 7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X 0201 kana from G1 without escape sequences, using Shift Out and Shift In or setting the eighth bit (GR-invoked), respectively."

In harmony with this, we have always accepted bytes from 0xA3-0xDF and decoded them to the corresponding hiragana/katakana. However, at some point I accidentally broke output for these kana. You can see the problem in 3v4l.org by running this program:

<?php
echo bin2hex(mb_convert_encoding("\xA3", 'JIS', 'JIS'));

The results are:

Output for 8.2rc1 - rc3
1b244200231b2842
Output for 7.4.0 - 7.4.33, 8.0.1 - 8.0.25, 8.1.12
1b2849231b2842
Output for 8.1.0 - 8.1.11
1b284923

You can see that from 8.1.0 - 8.1.11, there was a missing escape sequence at the end. That was caused because the flush functions were not being called properly, and has already been fixed. However, this also shows that the output for 8.2rc1-rc3 is completely invalid. It is trying to output a JISX 0208 sequence, but with 0x00 as one of the JISX 0208 bytes, which is illegal.

Add the missing code which will make the new text conversion filters behave the same as the old ones when outputting hiragana/katakana in JIS encoding.

FYA @cmb69 @Girgias @nikic @kamil-tekiela

For JIS encoding, hiragana and katakana can be input in multiple forms.
One form uses JISX 0201 escape sequences. Another is called 'GR-invoked'
kana.

In the context of ISO-2022 encoding, bytes with a zero bit in the MSB
are called "GL" (or "graphics left") and those with the MSB set are
called "GR" (or "graphics right"). Regarding the variants of
ISO-2022-JP which are called "JIS7" and "JIS8", Wikipedia states:

"Other, older variants known as JIS7 and JIS8 build directly on the
7-bit and 8-bit encodings defined by JIS X 0201 and allow use of JIS X
0201 kana from G1 without escape sequences, using Shift Out and Shift
In or setting the eighth bit (GR-invoked), respectively."

In harmony with this, we have always accepted bytes from 0xA3-0xDF and
decoded them to the corresponding hiragana/katakana. However, at some
point I accidentally broke output for these kana. You can see the
problem in 3v4l.org by running this program:

    <?php
    echo bin2hex(mb_convert_encoding("\xA3", 'JIS', 'JIS'));

The results are:

    Output for 8.2rc1 - rc3
    1b244200231b2842
    Output for 7.4.0 - 7.4.33, 8.0.1 - 8.0.25, 8.1.12
    1b2849231b2842
    Output for 8.1.0 - 8.1.11
    1b284923

You can see that from 8.1.0 - 8.1.11, there was a missing escape
sequence at the end. That was caused because the flush functions were
not being called properly, and has already been fixed. However, this
also shows that the output for 8.2rc1-rc3 is completely invalid.
It is trying to output a JISX 0208 sequence, but with 0x00 as one of
the JISX 0208 bytes, which is illegal.

Add the missing code which will make the new text conversion filters
behave the same as the old ones when outputting hiragana/katakana in
JIS encoding.
Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Please apply to PHP-8.2 (and merge up as usual).

@Girgias
Copy link
Member

Girgias commented Nov 22, 2022

Thank you!

Please apply to PHP-8.2 (and merge up as usual).

Do you mean 8.1 and upwards?

@cmb69
Copy link
Member

cmb69 commented Nov 22, 2022

Do you mean 8.1 and upwards?

You can see that from 8.1.0 - 8.1.11, there was a missing escape sequence at the end. That was caused because the flush functions were not being called properly, and has already been fixed.

So, only PHP-8.2 upwards, I think. :)

@Girgias
Copy link
Member

Girgias commented Nov 22, 2022

Do you mean 8.1 and upwards?

You can see that from 8.1.0 - 8.1.11, there was a missing escape sequence at the end. That was caused because the flush functions were not being called properly, and has already been fixed.

So, only PHP-8.2 upwards, I think. :)

Indeed, missed that paragraph.

@alexdowad
Copy link
Contributor Author

Thank you for the review!

@alexdowad alexdowad closed this Nov 22, 2022
@alexdowad alexdowad deleted the jisfix branch December 4, 2022 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants