Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] DocumentFragment parse behavior wrt body tag #3023

Closed
flavorjones opened this issue Nov 15, 2023 · 6 comments
Closed

[bug] DocumentFragment parse behavior wrt body tag #3023

flavorjones opened this issue Nov 15, 2023 · 6 comments

Comments

@flavorjones
Copy link
Member

Please describe the bug

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

doc = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.parse(doc).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.parse(doc).to_html

emits:

<body>
  
    <div>hello</div>
  

</body>
---

  
    <div>hello</div>
  


Expected behavior

I would expect:

  • the body tag to maintain the attributes declared on it
  • the HTML4 and HTML5 parsers to have the same behavior

Additional context

@marcoroth reported at RubyConf! woot

@flavorjones flavorjones added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Nov 15, 2023
@flavorjones flavorjones added topic/fragment and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 30, 2024
@flavorjones
Copy link
Member Author

@marcoroth I spent some time on this today, and I have a potential workaround for you that I'd like your feedback on, and if it works for Stimulus Reflex then I'll put some work in to make it easier to use.

first some questions about the use case

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

I'd like to understand the use case of Stimulus Reflex a bit better to understand if there's a better API that Nokogiri could be offering here. Do you really not know if the content is a document or a fragment?

transform the code into an equivalent

First let me convert the code above to use DocumentFragment.new instead of .parse:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

html = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.new(Nokogiri::HTML4::Document.new, html).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html).to_html

Note that the output is the same:

<body>

    <div>hello</div>


</body>
---


    <div>hello</div>


diagnosis

By default, both parsers will try to parse the fragment in the context of a body tag.

The HTML4 behavior that we see is due to a hack made a long time ago in c23eb88: if a context node isn't specified, and there happens to be a line that matches /^\s*<body/i, then we return the tree including the body node.

If you change the script above to format the HTML slightly differently:

html = <<~HTML
 <html><body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

the output changes for the HTML4 fragment parser:


    <div>hello</div>
  

---

    <div>hello</div>
  

I really don't like this code very much, for a few reasons:

  • there's no way to parse head fragments as part of the head
  • the "body" tag should be more consistently returned, and IMHO it should never be returned in the situations where we assume a "context node" of body

but it's been there for 14 years now and I can't really change it without breaking people's code.

We see what we see for the HTML5 fragment parser because it's working exactly as intended: we parse in the context of a body tag, and we return the children of that context node. IMHO this is the correct implementation and I'm hesitant to make it more complicated to try to match the hacky HTML4 implementation.

potential workaround

The HTML5::DocumentFragment.new method supports a "context node" argument, so we can tell the parser more exactly what we want to do. In the above case, because it's a complete document and not a fragment, I think what we really want to do is parse in the context of an html tag, so let's try that:

puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html, "html").to_html

will output

<head></head><body class="foo">
    <div>hello</div>


</body>

A couple of things to call out:

  • good: the original body tag including attributes is present
  • the head tag is also present, meaning that if the input had head content, it would be here, too:
html = <<~HTML
 <html>
   <head>
     <link rel="stylesheet" href="mystyle.css">
   </head>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

# ...

outputs

<body>

    <link rel="stylesheet" href="mystyle.css">


    <div>hello</div>


</body>
---
<head>
    <link rel="stylesheet" href="mystyle.css">
  </head>
  <body class="foo">
    <div>hello</div>


</body>

Perhaps naively, the HTML5 parser's output matches the input more exactly than the HTML4 parser's output. But I have no idea whether this is helpful for Stimulus Reflex's use case.

Does this work for you? If not, why not? If so, then I'll do some work to make the context node input available to callers of DocumentFragment.parse.

@marcoroth
Copy link

marcoroth commented Apr 23, 2024

Hey @flavorjones, thanks for taking a look at this!

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

Yeah this is true, ideally we could pick the right class to parse an HTML document. We have had efforts (like stimulusreflex/stimulus_reflex#601, stimulusreflex/stimulus_reflex#622) that tried to pick the right class depending on the context.

But sadly we haven't found one Nokogiri class/method combination that solves all the behaviors we saw/needed. It seems like that we would need a combination of the parts of behavior we saw in these classes:

  • Nokogiri:HTML4::Document.parse
  • Nokogiri:HTML4::DocumentFragment.parse
  • Nokogiri:HTML5::Document.parse
  • Nokogiri:HTML5::DocumentFragment.parse

To explain the use-case a bit more: StimulusReflex exposes a morph method that takes in two arguments. The first argument can be any CSS selector and the second argument is an HTML string.

Usually we see people do thing like:

morph "#my-element", %(<div id="my-element>My element is updated!</div>)

In this case we'd want to use HTML5::DocumentFragment.parse(...) and that works perfectly fine. But there are also valid use-cases where people do something like:

morph "body", %(<body class="body with updated classes">...</body>)

In this case HTML5::DocumentFragment.parse(...) wouldn't work anymore because it strips elements/attributes and transforms the document.

In the phlexing gem I solved it by doing something like this, but this feels somehow hacky:

https://github.com/marcoroth/phlexing/blob/90b3af31ea56b70c101d5775d09f6f2c43a241c7/gem/lib/phlexing/parser.rb#L15-L29

Ideally Nokogiri would expose an API that would parse HTML as-is. So that it parses the HTML without having any spec in mind. And so, that it also wouldn't try to make it spec-compliant. I guess you could say we are looking for something like an AST for HTML that returns a representation of whatever you give it, be it valid/spec-conform or not.

But I would totally understand if this is out-of-scope for a gem like Nokogiri, which is also why I mentioned in stimulusreflex/stimulus_reflex#652 that it might make sense to move off of Nokogiri.

The problem is that Rails doesn't really care if you ship spec-compliant HTML to your browser through a Rails request/response cycle. But since StimulusReflex tries to dynamically update the rendered page by re-rendering the view and by passing it through Nokogiri it breaks the flow if you have non-spec-compliant HTML.

If people would write HTML5-compliant HTML views we wouldn't have this discussion, but since modern browsers got so relaxed with processing invalid HTML it now gets into the way when StimulusReflex tries to update the view using Nokogiri. Maybe we can also do something in Rails that warns people if their controller action returns invalid HTML?

Potential workaround: The HTML5::DocumentFragment.new method supports a "context node" argument

While I think that a "context node" argument would be useful I don't think it really solves our problem, since we'd still have to know which the right context node is given an arbitrary string of HTML.

Working on this parsing topic through different projects and requirements (besides StimulusReflex) I think there's a bigger need for a tool/parser that goes beyond Nokogiri and even Ruby.

I'm really unsure how we can and should proceed here. Maybe it would make sense to talk it through face-to-face so we don't talk past each other. Thank you for the support so far!

@flavorjones
Copy link
Member Author

@marcoroth Let's chat in slack, I think we'd benefit from higher-fidelity conversation.

@flavorjones
Copy link
Member Author

Circling back to close this off: after a good long chat, I think I understand the StimulusReflex use case much better, and I think Marco understands the challenges presented by the HTML5 spec a bit better.

In summary, I think we agreed on this set of problem statements, which Nokogiri doesn't help answer today:

  1. is the input a document or a fragment?
  2. if it's a fragment, what should the context node be?
  3. if additional parent nodes are created, how do we select the correct child nodes?

Figuring out how to do this as a user of Nokogiri's API is just too hard today. So I made a thing to try to ease this pain: https://github.com/flavorjones/nokogiri-html5-inference

I'm going to close this ticket, and hopefully we can continue the conversation about what's working or not in an issue on that repo!

Thanks again for bringing this up, @marcoroth!

@marcoroth
Copy link

Thanks for the summary and for talking this through @flavorjones, it's really appreciated! 🙏🏼

@stevecheckoway
Copy link
Contributor

I spent a little time testing if this could be parsed as a fragment with a template node as the context but unfortunately a body start tag is ignored in that case.

Parsing using a template node as the context node may help the other cases like parsing td elements.

marcoroth added a commit to stimulusreflex/stimulus_reflex that referenced this issue May 5, 2024
…t parsing (#696)

This pull request implements the
[`nokogiri-html5-inference`](https://github.com/flavorjones/nokogiri-html5-inference)
to deal with parsing HTML5 fragments to be used with Selector Morphs.

The `nokogiri-html5-inference` gem come about out of the discussions in
#652 and sparklemotion/nokogiri#3023.

Fixes #652, Resolves
#692, Resolves
#674

Big thanks to @flavorjones to talking and working this through with me!

## Why should this be added

This pull request makes the handling of document fragments for selector
morphs more natural/predictable and resolves the last open issues to
release the final versions of StimulusReflex 3.5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants