Skip to content

Test the encoding sniffing algorithm (aka meta prescan) #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Test the (meta) prescan algorithm
This change adds a `preparsed` subdirectory in the `encoding` directory,
with tests for which the result of the *encoding sniffing algorithm* at
https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the
expected result — that is, tests for which the expected result is the
output of running *only* the encoding sniffing algorithm (of which the
main sub-algorithm is the so-called “meta prescan”) — without
also running the tokenization state machine and tree-construction stage.

This change also adds a README file that explicitly documents what the
expected results for the encoding tests are, based on whether or not
they’re in the `preparsed` subdirectory.

Without those changes, it’s unclear whether the expected results shown
in the existing tests are for the output of fully parsing the test data —
through the tokenization state machine and tree-construction stage — or
instead just the output of the encoding sniffing algorithm only. And
without those changes, we also don’t have any tests a system can use for
testing only the output from the encoding sniffing algorithm.

Fixes #28
  • Loading branch information
sideshowbarker committed Aug 24, 2020
commit 1e10bdb64b6fc9bc43005a6d07f0b2d1b98a27af
39 changes: 39 additions & 0 deletions encoding/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Encoding Tests
==============

Each file containing encoding tests has any number of tests separated by
two newlines (LF) and a single newline before the end of the file:

[TEST]LF
LF
[TEST]LF
LF
[TEST]LF

...where [TEST] is the format documented below.

Encoding test format
====================

Each test must begin with a string "\#data", followed by a newline (LF).
All subsequent lines until a line that says "\#encoding" are the test data
and must be passed to the system being tested unchanged, except with the
final newline (on the last line) removed.

Then there must be a line that says "\#encoding", followed by a newline
(LF), followed by string indicating an encoding name, followed by a newline
(LF). The encoding name indicated is the expected character encoding for
the output with the given test data as input.

For the tests in the `preparsed` subdirectory, the encoding name indicated
is the expected result of running the *encoding sniffing algorithm* at
https://html.spec.whatwg.org/#encoding-sniffing-algorithm with the given
test data as input; this is, it's the expected result of running *only* the
*encoding sniffing algorithm* — without also running the tokenization state
machine and tree-construction stage defined in the spec.

For all tests outside the subdirectory named `preparsed`, the encoding name
indicated is instead the expected character encoding for the output after
fully parsing the given test data; that is, it's the expected character
encoding for the output after running the tokenization state machine and
tree-construction stage.
Loading