Plan for encoding/simpler prefetching in html5ever

This issue is a collection of some structural changes I want to make to html5ever (and its integration in the servo project).
The intent is both for me to act as a rubber duck and for reviewers to make sure what I plan makes sense (so complaints/questions are welcome).


### Current architecture

For simplicity, the following diagram does not include XML parsing, XML is also not really relevant for now.

```
                              │                                                                              
                              │ Bytes                                                                        
                              │                                                                              
                              ▼                                        ──┐                                   
                        ┌───────────┐    Bytes       ┌─────────────┐     │                                   
               Utf8     │ServoParser├───────────────►│   Decoder   │     │                                   
         ┌──────────────┤(HTML only)│◄───────────────┤Bytes -> Utf8│     │ Servo                                   
         │              └─┬─────────┘    Utf8        └─────────────┘     │                                   
         │                │       ▲                                    ──┘                                   
         │                │       │                                                                          
         │                │       │                                                                          
         │                │       │                                                                          
         │           Utf8 │       │ blocking <script> tags                                                            
         │                │       │                                                                          
         │                │       │                                                                          
         ▼                ▼       │                                    ──┐                                   
┌────────────────┐  ┌─────────────┴──────────────┐                       │                                   
│    Prefetcher  │  │      HTML Parser           │                       │                                   
│                │  │                            │                       │                                   
│   ┌─────────┐  │  │     ┌─────────┐            │                       │                                   
│   │Tokenizer│  │  │     │Tokenizer│            │                       │                                   
│   └───┬─────┘  │  │     └─┬───────┘            │                       │                                   
│       │        │  │Tokens │     ▲              │                       │                                   
│       │ Tokens │  │       │     │              │                       │                                   
│       │        │  │       ▼     │ <script> tags│                       │ html5ever                                  
│       ▼        │  │    ┌────────┴───┐          │                       │                                   
│ ┌─────────────┐│  │    │Tree Builder│          │                       │                                   
│ │Prefetch Sink││  │    └──────┬─────┘          │                       │                                   
│ └─────────────┘│  │           │                │                       │                                   
└────────────────┘  │           │ Tree ops       │                       │                                   
                    │           ▼                │                       │                                   
                    │      ┌─────────┐           │                       │                                   
                    │      │Tree Sink│           │                       │                                   
                    │      └─────────┘           │                       │                                   
                    └────────────────────────────┘                       │                                  
                                                                       ──┘                                   
```

There are multiple downsides to this architecture:

* The prefetcher blocks the main thread (https://github.com/servo/servo/issues/36482) and tokenizes everything a second time. Tokenization can take a significant amount of time (250ms for https://html.spec.whatwg.org)
* Encoding support with `<meta charset>` tags is hard, because the decoder is far away from the parser
* Reentrancy requires awkward interior mutability in html5ever, because the `TreeSink` can invoke the `ServoParser` again with `document.write`

### Encoding support
When the parser encounters a `<meta charset>` tag it cannot proceed immediately. Instead, a message needs to be bubbled up from the parser, through the tokenizer to the place where the decoding happens and there we need to [`change the encoding while parsing`](https://html.spec.whatwg.org/#change-the-encoding). This is why the parser architecture would benefit from the decoder being closer to the tree builder.

Right now the decoding happens in the `network_decoder` and `network_input` fields on `ServoParser`: https://github.com/servo/servo/blob/4e9993128b81b5a3757970786d47fb165ed3ebca/components/script/dom/servoparser/mod.rs#L111-L116.
The problem with naively moving the decoding process into html5ever is that all input would be decoded twice (once in the prefetcher and once in the "main" parser).

The final design plan is to have an (optional) wrapper around the `Tokenizer` which handles both decoding and buffering of input.
Pretty much all of that is already implemented in https://github.com/servo/html5ever/pull/590  with the `DecodingParser` type. Note that this type also takes care of the intricacies of `document.write` which might benefit other users of `html5ever`. Nico requested that this `DecodingParser` stay behind a feature flag.

### Parallel Parsing / Real Prefetching

My current plan is very similar to https://github.com/servo/servo/pull/19203, except that I want the input stream to live in the parser thread. Otherwise you have to send (read: clone) the input each time you invoke the parser. The [input stream](https://html.spec.whatwg.org/#input-stream) is a spec concept that buffers input which has been received from the network but not yet been processed by the tokenizer. It is also where the decoding from bytes to UTF8 happens. `html5ever` does not currently implement this.

Below is a somewhat simplified diagram. The `extra info` sent to the parser thread mostly relates to `document.write` and is not included here for simplicity.

```
                                                          
 ┌────────────────┐                   ┌─────────────────┐ 
 │  Main Thread   │                   │  Parser Thread  │ 
 │                │      Bytes        │  ┌────────────┐ │ 
 │                ├─────────────────► │  │Input Stream│ │ 
 │ ┌───────────┐  │  (+ extra info)   │  └────┬───────┘ │ 
 │ │ServoParser│  │                   │       ▼         │ 
 │ └────┬──────┘  │                   │  ┌─────────┐    │ 
 │      │         │   Prefetch Ops    │  │Tokenizer│    │ 
 │      │ Parse Op│◄──────────────────┤  └────┬────┘    │ 
 │      ▼         │     Parse Ops     │       ▼         │ 
 │┌─────────────┐ │                   │  ┌───────────┐  │ 
 ││ParseExecutor│ │                   │  │TreeBuilder│  │ 
 │└─────────────┘ │                   │  └───────────┘  │ 
 └────────────────┘                   └─────────────────┘ 
```

A parse operation in the diagram above could be something like `AppendChild` or `SetQuirksMode` - mirroring the methods of the current [`TreeSink`](https://docs.rs/html5ever/latest/html5ever/interface/trait.TreeSink.html) trait. 

Notice how this design allows us to support reentrancy without interior mutability in html5ever - the parser thread does not need to know about reentrant parsing at all, since it just processes input from the main thread. 

### Ordering of changes
The current plan is to
1) Move buffering of input into `html5ever`. Makes everything else easier - This will be a significant breaking change to the API!
2) Implement parallel HTML parsing in servo, to be able to implement encoding support without decoding everything twice.
3) Implement support for `<meta charset>`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plan for encoding/simpler prefetching in html5ever #617

Current architecture

Encoding support

Parallel Parsing / Real Prefetching

Ordering of changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plan for encoding/simpler prefetching in html5ever #617

Description

Current architecture

Encoding support

Parallel Parsing / Real Prefetching

Ordering of changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions