Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors in the tokenizer and tree construction tests. #136

Merged
merged 23 commits into from
Jul 5, 2021

Commits on Jun 26, 2021

  1. -- in a comment isn't an error

    `<!-- -- -->` is not an error. The middle `--` cause the tokenizer to
    enter the comment end state. The "anything else" clause appends two `-`
    to the comment token's data and the current input character is
    reconsumed in the comment state.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    5a7149f View commit details
    Browse the repository at this point in the history
  2. <!-----> is not an error

    Starting from the data state, we have
    1. `<` switches to the tag open state
    2. `!` switches to the markup declaration open state
    3. `--` switches to the comment start state
    4. `-` (the third one) switches to the comment start dash state
    5. `-` (the fourth one) switches to the comment end state
    6. `-` (the fifth one) appends `-` to the comment and does not change
       state
    7. `>` emits the comment token and switches to the data state.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    c70ef7d View commit details
    Browse the repository at this point in the history
  3. Add DOCTYPE errors (and remove one)

    A DOCTYPE token is an error in one of three cases:
    1. The token's name is not `html`;
    2. The token's public identifier is not missing;
    3. The token's system identifier is not missing and the token's system
       identifier isn't `about:legacy-compat`.
    
    This appears to have changed at some point from a much more complex set
    of conditions.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    4c2b23a View commit details
    Browse the repository at this point in the history
  4. Fix errors in doctypes

    A `>` after `DOCTYPE` is a missing-doctype-name parse error but it is
    not also a missing-whitespace-before-doctype-name parse error.
    
    Doctypes with a public identifier is a (currently unnamed) parse error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    4022182 View commit details
    Browse the repository at this point in the history
  5. Fix entity errors

    Named character references in attributes whose last character is not `;`
    and for which the next input character is `=` (or ASCII alphanumeric,
    but this isn't tested here), flushes the code points consumed as a
    character reference _without_ adding a parse error.
    
    Named character references not in attributes whose last character is not
    `;` are errors, regardless of the following character as noted in the
    `#new-errors` section but without an entry in `#errors`, the number of
    errors are wrong. (See
    html5lib#107).
    
    Separately, this adds the missing expected-doctype-but-got-start-tag
    error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    607e334 View commit details
    Browse the repository at this point in the history
  6. <!doctype html><script><! has only one error

    Here are the (abbreviated) steps (intermingling the tree-construction
    and tokenizer).
    
    1. `<script>` token causes `html` and `head` elements to be inserted and
       is processed in the "in head" insertion mode
    2. `<script>` token switches the tokenizer to the script data state and
       switches to the "text" insertion mode
    3. `<` switches to the script data less-than sign state
    4. `!` switches to the script data escape start state and emits `<!`
    5. EOF is reconsumed in the script data state
    6. EOF emits an EOF token
    7. EOF token (in the "text" insertion mode) is a parse error,
       `<script>` is popped off the stack of open elements, switches back to
       the "in head" insertion mode and reprocesses the token
    8. EOF token (in "in head") triggers the "anything else clause" which
       pops the `head` element, switches to "after head", inserts a `body`
       token, switches to "in body", and reprocesses
    9. EOF token (in "in body") stops parsing with no error because the
       stack of open elements contains `html` and `body`
    
    Only step 7 adds a parse error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    745869f View commit details
    Browse the repository at this point in the history
  7. Include the new error in #errors

    Without it, the number of errors is incorrect.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    e295593 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    c9e881e View commit details
    Browse the repository at this point in the history
  9. The space is not foster parented

    ```
    A<table><tr> B</tr> </em>C</table>
    ```
    
    The second space isn't foster parented. The `</tr>` triggers the
    "anything else" clause of the in table text insertion mode which causes
    ` B` to be foster parented. The following space clears the pending table
    character tokens list and thus the `</em>` inserts the space without
    foster parenting.
    
    This is also clear from the text node `A BC` in the result. A foster
    parented second space would result in `A B C`.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    25b393d View commit details
    Browse the repository at this point in the history
  10. End tag causes two parse errors

    ```
    <math><annotation-xml></svg>x
    ```
    
    The `</svg>` is parsed in foreign context because
    1. The stack of open elements is not empty;
    2. The adjusted current node (`annotation-xml` element) is not in the
       HTML namespace;
    3. The adjusted current node is not a MathML text integration point;
    4. The adjusted current node is not a MathML text integration point
       (separate condition from 3);
    5. The adjusted current node is a MathML `annotation-xml` element but
       the token (`</svg>`) is not a start tag;
    6. The token is not a start tag (also the `annotation-xml` isn't an HTML
       integration point because it lacks a particular attribute);
    7. The token isn't a character token (also the `annotation-xml` isn't an
       HTML integration point).
    
    Thus the "any other end tag" clause of 12.2.6.5 "The rules for parsing
    tokens in foreign context" applies.
    
    The current node's tag name (`annotation-xml`) is not the same as the
    tag name of the token (`svg`) so this is a parse error.
    
    Since there's no `svg` element in the stack of open elements, the loop
    in step 3 through 6 of the "any other end tag" clause exits when the
    `body` element is reached. and the token is processed according to the
    current insertion mode, "in body."
    
    The `</svg>` matches the "any other end tag" of "in body." Since the
    MathML `annotation-xml` element is not an HTML element but is special
    which is a second parse error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    20620b5 View commit details
    Browse the repository at this point in the history
  11. Add missing errors

    These are tricky. A `<math>` tag as the first token in a document
    fragment whose context node is `td` (or `th`, but this isn't tested
    here) is fine. One in a document fragment whose context node is `tr`,
    `thead`, `tbody`, or `tfoot` is a parse error and the elements are
    foster parented.
    
    The table element start tags after the `<mo>` tag are parsed as html and
    there are no `tr` (for `tr` elements) elements or `thead`, `tbody`, or
    `tfoot` (for the others) elements in table scope which is a parse error.
    I just invented some sames for those.
    
    The `</table>` tag after the `<mo>` is parsed as foreign but it doesn't
    match `<mo>` so it's a parse error.
    
    Finally, the EOF occurs while a bunch of elements are open which is a
    parse error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    5ef4bd4 View commit details
    Browse the repository at this point in the history
  12. Add missing errors

    The `</td>` is parsed in the "in cell" insertion mode and is an error
    because it doesn't match the open `span` element. Each of the three
    characters in `Foo` is reparented and is an error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    45d2577 View commit details
    Browse the repository at this point in the history
  13. More missing errors

    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    dd97d25 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    cc5b122 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    e599f25 View commit details
    Browse the repository at this point in the history
  16. Add new errors to errors

    If the `#errors` section should have the same number of lines as errors
    (see html5lib#107), then the
    NULL-character errors need to be accounted for.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    5e88d5c View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    be61df3 View commit details
    Browse the repository at this point in the history
  18. Remove duplicate errors

    Also fix up the `#new-errors` section to make the line and column
    numbers match the others.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    05319d0 View commit details
    Browse the repository at this point in the history
  19. Remove duplicate errors

    These all appear to be duplicated errors (or more like an explanation of
    the error).
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    b6520ef View commit details
    Browse the repository at this point in the history
  20. Fix template errors

    Each character in the table (caused by the `<col>`) gets reparented and
    causes an error.
    
    The second `<a>` gets reparented but there's already one open, so that's
    a second error but the open one is not in scope so that's a third error.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    fa35532 View commit details
    Browse the repository at this point in the history
  21. Fix EOF error line/columns, add to #errors

    Fix the line and column numbers.
    
    Currently, the number of errors in the `#errors` section of the test is
    the correct number of errors. Counting the ones in `#new-errors` breaks
    hundreds of tests because they contain an equivalent error in `#errors`.
    So this adds the eof in comment error to `#errors` as well.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    65e3699 View commit details
    Browse the repository at this point in the history
  22. Add missing errors

    Adds the other spec-mandated errors. When an error is caused by a tag,
    the line and column numbers are the line and column the tag starts in.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    26f2b8a View commit details
    Browse the repository at this point in the history
  23. Remove duplicated error

    This error line appears to have been duplicated by mistake.
    stevecheckoway committed Jun 26, 2021
    Configuration menu
    Copy the full SHA
    535e74b View commit details
    Browse the repository at this point in the history