[bug] DocumentFragment parse behavior wrt `body` tag #3023

flavorjones · 2023-11-15T18:00:07Z

Please describe the bug

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

doc = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.parse(doc).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.parse(doc).to_html

emits:

<body>
  
    <div>hello</div>
  

</body>
---

  
    <div>hello</div>

Expected behavior

I would expect:

the body tag to maintain the attributes declared on it
the HTML4 and HTML5 parsers to have the same behavior

Additional context

@marcoroth reported at RubyConf! woot

The text was updated successfully, but these errors were encountered:

flavorjones · 2024-04-08T22:02:32Z

@marcoroth I spent some time on this today, and I have a potential workaround for you that I'd like your feedback on, and if it works for Stimulus Reflex then I'll put some work in to make it easier to use.

first some questions about the use case

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

I'd like to understand the use case of Stimulus Reflex a bit better to understand if there's a better API that Nokogiri could be offering here. Do you really not know if the content is a document or a fragment?

transform the code into an equivalent

First let me convert the code above to use DocumentFragment.new instead of .parse:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

html = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.new(Nokogiri::HTML4::Document.new, html).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html).to_html

Note that the output is the same:

<body>

    <div>hello</div>


</body>
---


    <div>hello</div>

diagnosis

By default, both parsers will try to parse the fragment in the context of a body tag.

The HTML4 behavior that we see is due to a hack made a long time ago in c23eb88: if a context node isn't specified, and there happens to be a line that matches /^\s*<body/i, then we return the tree including the body node.

If you change the script above to format the HTML slightly differently:

html = <<~HTML
 <html><body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

the output changes for the HTML4 fragment parser:


    <div>hello</div>
  

---

    <div>hello</div>

I really don't like this code very much, for a few reasons:

there's no way to parse head fragments as part of the head
the "body" tag should be more consistently returned, and IMHO it should never be returned in the situations where we assume a "context node" of body

but it's been there for 14 years now and I can't really change it without breaking people's code.

We see what we see for the HTML5 fragment parser because it's working exactly as intended: we parse in the context of a body tag, and we return the children of that context node. IMHO this is the correct implementation and I'm hesitant to make it more complicated to try to match the hacky HTML4 implementation.

potential workaround

The HTML5::DocumentFragment.new method supports a "context node" argument, so we can tell the parser more exactly what we want to do. In the above case, because it's a complete document and not a fragment, I think what we really want to do is parse in the context of an html tag, so let's try that:

puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html, "html").to_html

will output

<head></head><body class="foo">
    <div>hello</div>


</body>

A couple of things to call out:

good: the original body tag including attributes is present
the head tag is also present, meaning that if the input had head content, it would be here, too:

html = <<~HTML
 <html>
   <head>
     <link rel="stylesheet" href="mystyle.css">
   </head>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

# ...

outputs

<body>

    <link rel="stylesheet" href="mystyle.css">


    <div>hello</div>


</body>
---
<head>
    <link rel="stylesheet" href="mystyle.css">
  </head>
  <body class="foo">
    <div>hello</div>


</body>

Perhaps naively, the HTML5 parser's output matches the input more exactly than the HTML4 parser's output. But I have no idea whether this is helpful for Stimulus Reflex's use case.

Does this work for you? If not, why not? If so, then I'll do some work to make the context node input available to callers of DocumentFragment.parse.

marcoroth · 2024-04-23T11:30:54Z

Hey @flavorjones, thanks for taking a look at this!

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

Yeah this is true, ideally we could pick the right class to parse an HTML document. We have had efforts (like stimulusreflex/stimulus_reflex#601, stimulusreflex/stimulus_reflex#622) that tried to pick the right class depending on the context.

But sadly we haven't found one Nokogiri class/method combination that solves all the behaviors we saw/needed. It seems like that we would need a combination of the parts of behavior we saw in these classes:

Nokogiri:HTML4::Document.parse
Nokogiri:HTML4::DocumentFragment.parse
Nokogiri:HTML5::Document.parse
Nokogiri:HTML5::DocumentFragment.parse

To explain the use-case a bit more: StimulusReflex exposes a morph method that takes in two arguments. The first argument can be any CSS selector and the second argument is an HTML string.

Usually we see people do thing like:

morph "#my-element", %(<div id="my-element>My element is updated!</div>)

In this case we'd want to use HTML5::DocumentFragment.parse(...) and that works perfectly fine. But there are also valid use-cases where people do something like:

morph "body", %(<body class="body with updated classes">...</body>)

In this case HTML5::DocumentFragment.parse(...) wouldn't work anymore because it strips elements/attributes and transforms the document.

In the phlexing gem I solved it by doing something like this, but this feels somehow hacky:

https://github.com/marcoroth/phlexing/blob/90b3af31ea56b70c101d5775d09f6f2c43a241c7/gem/lib/phlexing/parser.rb#L15-L29

Ideally Nokogiri would expose an API that would parse HTML as-is. So that it parses the HTML without having any spec in mind. And so, that it also wouldn't try to make it spec-compliant. I guess you could say we are looking for something like an AST for HTML that returns a representation of whatever you give it, be it valid/spec-conform or not.

But I would totally understand if this is out-of-scope for a gem like Nokogiri, which is also why I mentioned in stimulusreflex/stimulus_reflex#652 that it might make sense to move off of Nokogiri.

The problem is that Rails doesn't really care if you ship spec-compliant HTML to your browser through a Rails request/response cycle. But since StimulusReflex tries to dynamically update the rendered page by re-rendering the view and by passing it through Nokogiri it breaks the flow if you have non-spec-compliant HTML.

If people would write HTML5-compliant HTML views we wouldn't have this discussion, but since modern browsers got so relaxed with processing invalid HTML it now gets into the way when StimulusReflex tries to update the view using Nokogiri. Maybe we can also do something in Rails that warns people if their controller action returns invalid HTML?

Potential workaround: The HTML5::DocumentFragment.new method supports a "context node" argument

While I think that a "context node" argument would be useful I don't think it really solves our problem, since we'd still have to know which the right context node is given an arbitrary string of HTML.

Working on this parsing topic through different projects and requirements (besides StimulusReflex) I think there's a bigger need for a tool/parser that goes beyond Nokogiri and even Ruby.

I'm really unsure how we can and should proceed here. Maybe it would make sense to talk it through face-to-face so we don't talk past each other. Thank you for the support so far!

flavorjones · 2024-04-23T12:27:13Z

@marcoroth Let's chat in slack, I think we'd benefit from higher-fidelity conversation.

flavorjones · 2024-04-24T21:21:01Z

Circling back to close this off: after a good long chat, I think I understand the StimulusReflex use case much better, and I think Marco understands the challenges presented by the HTML5 spec a bit better.

In summary, I think we agreed on this set of problem statements, which Nokogiri doesn't help answer today:

is the input a document or a fragment?
if it's a fragment, what should the context node be?
if additional parent nodes are created, how do we select the correct child nodes?

Figuring out how to do this as a user of Nokogiri's API is just too hard today. So I made a thing to try to ease this pain: https://github.com/flavorjones/nokogiri-html5-inference

I'm going to close this ticket, and hopefully we can continue the conversation about what's working or not in an issue on that repo!

Thanks again for bringing this up, @marcoroth!

marcoroth · 2024-04-24T23:14:38Z

Thanks for the summary and for talking this through @flavorjones, it's really appreciated! 🙏🏼

stevecheckoway · 2024-04-25T00:12:54Z

I spent a little time testing if this could be parsed as a fragment with a template node as the context but unfortunately a body start tag is ignored in that case.

Parsing using a template node as the context node may help the other cases like parsing td elements.

@flavorjones

…t parsing (#696) This pull request implements the [`nokogiri-html5-inference`](https://github.com/flavorjones/nokogiri-html5-inference) to deal with parsing HTML5 fragments to be used with Selector Morphs. The `nokogiri-html5-inference` gem come about out of the discussions in #652 and sparklemotion/nokogiri#3023. Fixes #652, Resolves #692, Resolves #674 Big thanks to @flavorjones to talking and working this through with me! ## Why should this be added This pull request makes the handling of document fragments for selector morphs more natural/predictable and resolves the last open issues to release the final versions of StimulusReflex 3.5.

flavorjones added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Nov 15, 2023

marcoroth mentioned this issue Nov 26, 2023

Properly handle HTML fragments with self closing tags stimulusreflex/stimulus_reflex#674

Closed

2 tasks

flavorjones added topic/fragment and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 30, 2024

flavorjones mentioned this issue Apr 8, 2024

Document fragment now do not handle 2 siblings inputs stimulusreflex/stimulus_reflex#652

Closed

flavorjones closed this as completed Apr 24, 2024

marcoroth mentioned this issue Apr 24, 2024

Implement nokogiri-html5-inference to properly support HTML fragment parsing stimulusreflex/stimulus_reflex#696

Merged

2 tasks

stevecheckoway mentioned this issue Apr 25, 2024

Using a template context element flavorjones/nokogiri-html5-inference#2

Closed

flavorjones mentioned this issue Jul 15, 2024

Fix fragment quirks mode, and make it easier to specify fragment context #3246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] DocumentFragment parse behavior wrt `body` tag #3023

[bug] DocumentFragment parse behavior wrt `body` tag #3023

flavorjones commented Nov 15, 2023

flavorjones commented Apr 8, 2024

marcoroth commented Apr 23, 2024 •

edited

Loading

flavorjones commented Apr 23, 2024

flavorjones commented Apr 24, 2024

marcoroth commented Apr 24, 2024

stevecheckoway commented Apr 25, 2024

[bug] DocumentFragment parse behavior wrt body tag #3023

[bug] DocumentFragment parse behavior wrt body tag #3023

Comments

flavorjones commented Nov 15, 2023

flavorjones commented Apr 8, 2024

first some questions about the use case

transform the code into an equivalent

diagnosis

potential workaround

marcoroth commented Apr 23, 2024 • edited Loading

flavorjones commented Apr 23, 2024

flavorjones commented Apr 24, 2024

marcoroth commented Apr 24, 2024

stevecheckoway commented Apr 25, 2024

[bug] DocumentFragment parse behavior wrt `body` tag #3023

[bug] DocumentFragment parse behavior wrt `body` tag #3023

marcoroth commented Apr 23, 2024 •

edited

Loading