Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaped characters in source #3683

Open
edent opened this issue May 14, 2018 · 2 comments
Open

Escaped characters in source #3683

edent opened this issue May 14, 2018 · 2 comments
Labels
good first issue Ideal for someone new to a WHATWG standard or software project

Comments

@edent
Copy link
Contributor

edent commented May 14, 2018

The current source file has a large number of encoded entities. This makes it rather hard to edit and read. As UTF-8 is everywhere, is it time to replace these with their Unicode representation?

For example:

  <li value="9"><cite lang="sh">Црна мачка, бели мачор</cite>, 1998</li>

Becomes:

  <li value="9"><cite lang="sh">Црна мачка, бели мачор</cite>, 1998</li>

And

<p w-nodev>In an algorithm, steps in <span data-x="synchronous section">synchronous
  sections</span> are marked with &#x231B;.</p>

Could be changed to:

<p w-nodev>In an algorithm, steps in <span data-x="synchronous section">synchronous
  sections</span> are marked with ⌛.</p>

There is one obvious exception - invisible / non-printing characters.

Would you be interested in a pull request to transform all the &#x... references to decoded equivalent?

This builds upon the HTML5.3 work done in w3c/html#1280

@annevk
Copy link
Member

annevk commented May 14, 2018

I think that'd be fine. We already adopted UTF-8 to some extent as per 0b37b53. It'd be good if the PR message includes the methodology as this might be somewhat error prone.

@r12a
Copy link

r12a commented May 17, 2018

I agree with the idea of avoiding escapes unless necessary. Fwiw, there are some other situations where escapes can occasionally be useful, although i doubt there are many of those in the html spec:

  1. some bidi examples, esp including markup or punctuation, so that the sequence doesn't get messed up and difficult to read in the source (although straighforward monodirectional sequences of rtl scripts are usually best stored as unicode characters, as they don't cause confusion)
  2. any place you don't want normalisation to affect the character sequence
  3. sometimes isolated combining characters are easier to manage as escapes.

(A lot of people seem to find the utility at https://r12a.github.io/app-conversion/ useful for converting to/from escapes. I was just wondering whether it would be useful to point to it or something similar for the benefit of people writing source code contributions.)

@zcorpan zcorpan added the good first issue Ideal for someone new to a WHATWG standard or software project label Sep 1, 2018
annevk pushed a commit that referenced this issue Oct 30, 2018
mustaqahmed pushed a commit to mustaqahmed/html that referenced this issue Feb 15, 2019
mustaqahmed pushed a commit to mustaqahmed/html that referenced this issue Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Ideal for someone new to a WHATWG standard or software project
Development

No branches or pull requests

4 participants