Test the (meta) prescan algorithm

This change adds a `preparsed` subdirectory in the `encoding` directory, with tests for which the result of the *encoding sniffing algorithm* at https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the expected result — that is, tests for which the expected result is the output of running *only* the encoding sniffing algorithm (of which the main sub-algorithm is the so-called “meta prescan”) — without also running the tokenization state machine and tree-construction stage. This change also adds a README file that explicitly documents what the expected results for the encoding tests are, based on whether or not they’re in the `preparsed` subdirectory. Without those changes, it’s unclear whether the expected results shown in the existing tests are for the output of fully parsing the test data — through the tokenization state machine and tree-construction stage — or instead just the output of the encoding sniffing algorithm only. And without those changes, we also don’t have any tests a system can use for testing only the output from the encoding sniffing algorithm. Fixes #28
html5lib · sideshowbarker · Aug 21, 2020 · Aug 24, 2020 · Aug 24, 2020 · 1e10bdb64b6fc9bc43005a6d07f0b2d1b98a27af
commit 1e10bdb64b6fc9bc43005a6d07f0b2d1b98a27af
diff --git a/encoding/README.md b/encoding/README.md
@@ -0,0 +1,39 @@
+Encoding Tests
+==============
+
+Each file containing encoding tests has any number of tests separated by
+two newlines (LF) and a single newline before the end of the file:
+
+    [TEST]LF
+    LF
+    [TEST]LF
+    LF
+    [TEST]LF
+
+...where [TEST] is the format documented below.
+
+Encoding test format
+====================
+
+Each test must begin with a string "\#data", followed by a newline (LF).
+All subsequent lines until a line that says "\#encoding" are the test data
+and must be passed to the system being tested unchanged, except with the
+final newline (on the last line) removed.
+
+Then there must be a line that says "\#encoding", followed by a newline
+(LF), followed by string indicating an encoding name, followed by a newline
+(LF). The encoding name indicated is the expected character encoding for
+the output with the given test data as input.
+
+For the tests in the `preparsed` subdirectory, the encoding name indicated
+is the expected result of running the *encoding sniffing algorithm* at
+https://html.spec.whatwg.org/#encoding-sniffing-algorithm with the given
+test data as input; this is, it's the expected result of running *only* the
+*encoding sniffing algorithm* — without also running the tokenization state
+machine and tree-construction stage defined in the spec.
+
+For all tests outside the subdirectory named `preparsed`, the encoding name
+indicated is instead the expected character encoding for the output after
+fully parsing the given test data; that is, it's the expected character
+encoding for the output after running the tokenization state machine and
+tree-construction stage.