Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fullwidth punctuation missing from character classes appendix #185

Open
r12a opened this issue Feb 26, 2020 · 10 comments
Open

Fullwidth punctuation missing from character classes appendix #185

r12a opened this issue Feb 26, 2020 · 10 comments
Labels
jlreq-doc:future [JLReq-doc] Discussion items to be considered for future version(s) of JLreq document

Comments

@r12a
Copy link
Contributor

r12a commented Feb 26, 2020

A.6 Full stops (cl-06)
A.7 Commas (cl-07)
https://w3c.github.io/jlreq/#cl-06
https://w3c.github.io/jlreq/#cl-07

. | 002E | FULL STOP
, | 002C | COMMA

These rows contain U+FF0E FULLWIDTH FULL STOP and U+FF0C FULLWIDTH COMMA in the first column, but ASCII code points and names in the 2nd & 3rd. This appears to be incorrect.

We seem to have a similar issue wrt parentheses too.

@himorin
Copy link
Contributor

himorin commented Feb 27, 2020

@kidayasuo I don't think it's related to #166 that U+FF0E etc. are included in JIS 0208, but I think we had this discussion at F2F not to update definitions with including fullwidth corresponding ones (but consider in future updates as #166). Do you remember our discussions?

@xfq
Copy link
Member

xfq commented Feb 27, 2020

If we want to update them, here are the fullwidth punctuations that need to be updated:

(	0028	LEFT PARENTHESIS
[	005B	LEFT SQUARE BRACKET
{	007B	LEFT CURLY BRACKET
)	0029	RIGHT PARENTHESIS
]	005D	RIGHT SQUARE BRACKET
}	007D	RIGHT CURLY BRACKET
!	0021	EXCLAMATION MARK
?	003F	QUESTION MARK
:	003A	COLON
;	003B	SEMICOLON
.	002E	FULL STOP
,	002C	COMMA

And some fullwidth symbols, digits, and latin letters in https://w3c.github.io/jlreq/#cl-19 might need updating too (Greek and Cyrillic letters seem to be correct). Note that the brackets appear in more than one character class.

@macnmm
Copy link
Contributor

macnmm commented Feb 28, 2020

Latin punctuation in the ASCII range should not be confused with full-width punctuation (not ASCII) in terms of their use or the mojikumi class they belong to. '(' is not the same class or spacing as '('. I would argue that '(' and ')' are not eligible for use in Japanese composition or warichuu and the text must be '(' and ’)'. Am I misreading the table?

@kidayasuo
Copy link
Contributor

JLReq describes characters as if there is no such thing as “fullwidth” version (i.e. characters in fullwidth compatible area in Unicode). It is a part of its effort to make the description independent of the technology at the time as much as possible. It tried to separate the concept of “character” and its style such as their width, following unicode’s principle. It however made the character class appendix confusing.

As in a sense it is inherent in how JLReq is written, changing it will be a major work. I believe it would be a kind of work that should be done in the major rewrite of JLReq, or as a new document.

(JLReq is a record of what is and have been done in print. Its line layout rules assume and sometimes dependent of the workflow that involves manual inspection and manual adjustment. It is clear that we need a line layout rules for the digital architecture. It is what I meant by the major rewrite.)

@kidayasuo kidayasuo added the jlreq-doc:future [JLReq-doc] Discussion items to be considered for future version(s) of JLreq document label Jun 11, 2020
@r12a
Copy link
Contributor Author

r12a commented Jun 11, 2020

But i think it is a clear error to have the name 002E | FULL STOP alongside . (which is the fullwidth character) in the table. Alternatives may include:

  1. change the character in the table
  2. change the Unicode name and code point value in the table
  3. replace the Unicode name & code point with text explaining that the character is amiguous wrt its code point assigment in Unicode
  4. include both ordinary and fullwidth characters, code points and names in the same row

@kidayasuo
Copy link
Contributor

Unicode name of U+002E is FULL STOP, right? I think the solution #1 is reasonable in that it is along with how JLReq is written + smallest change. Let me discuss this with original authors in TF.

I personally believe ignoring fullwidth compatibility characters is confusing and it should be fixed at some point.

@r12a
Copy link
Contributor Author

r12a commented Jun 11, 2020

I also found that confusing, initially. (And sometimes still trip up over it.)

I also worry about changing the character to ASCII full stop, since it suggests that that is what authors should use, which i believe is incorrect. The distinction between the two as described in jlreq may be logically feasible, but in practise, especially without all the clever handling described in jlreq about optimal character widths, i think people are better off using the fullwidth forms, and i think they do use them generally.

Therefore, i'd be more inclined to change the label to U+FF0E FULLWIDTH FULL STOP rather than change the character displayed in the chart. That also makes it easy to understand the jlreq doc, because otherwise you have to get your head around the idea that this proportionally spaced character needs to be regarded as having width in order to follow the text.

@kidayasuo
Copy link
Contributor

Bin-sensei, on the JLReq TF mailing list, explained how this has happened: JLReq inherited the character class from JIS X 4051 where it indicates code points with JIS X 0213 plane, column and row. The Japanese period is translated to U+002E. He explained the situation in the NOTE at the beginning of the appendix. as it is explained he believes we can leave it as-is.

The discussion is continuing. You can jump in on the mailing list. I will translate.

@xfq
Copy link
Member

xfq commented Jun 12, 2020

Link to the note: https://w3c.github.io/jlreq/#h-note-283 (see especially the text after "To work around this issue...")

The method in jlreq (and JIS X 4051) may be logically correct, but I still think the correct code point should be used in order to reduce confusion.


(FWIW, clreq does not use this method. Personally, I think both the methods in clreq and jlreq have their own advantages, and are just different ways of thinking.)

@kidayasuo
Copy link
Contributor

It seems agreement @ JLReq TF is OK to make this change. Let’s take the approach #2 among possible approaches Richard suggested.

as this is not a small change and changes need to be throughly reviewed, deferring it to the next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jlreq-doc:future [JLReq-doc] Discussion items to be considered for future version(s) of JLreq document
Projects
None yet
Development

No branches or pull requests

5 participants