Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

Closed
dan42 opened this issue Sep 13, 2022 · 7 comments · Fixed by #2736
Labels
topic/gumbo Gumbo HTML5 parser
Milestone

Comments

@dan42
Copy link

dan42 commented Sep 13, 2022

I'm not sure if the title above is correct, but please consider these two ways of parsing a <table> within a <p>:

require "nokogiri"
tags = "<p>A <table><tr><td>B</table> C"
doc = Nokogiri::HTML5.parse("<div>"+tags)
puts doc
doc.at("div").inner_html = tags
puts doc

output:

<html><head></head><body><div><p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p></div></body></html>
<html><head></head><body><div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div></body></html>

I'm fairly certain the output should not be different. In both cases we're parsing the same tags, and the parent element is a <div>. Note that Nokogiri::HTML5.fragment(tags) produces the same structure as Nokogiri::HTML5.parse(tags). So this bug/behavior looks specific to coerce somehow. But in the code sample above if we replace HTML5.parse by HTML5.fragment, the two puts doc produce the same output. So honestly I'm not sure what's going on.

@dan42 dan42 added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Sep 13, 2022
@flavorjones
Copy link
Member

Thanks for opening this issue! Great question.

I dug in a bit and here's a code snippet that explains a bit about what's happening, which is that inner_html= parses a HTML5::DocumentFragment from the tags string:

#! /usr/bin/env ruby

require "nokogiri"

tags = "<p>A <table><tr><td>B</table> C"
doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>"

# making explicit the calls which implement Node#inner_html=
doc = Nokogiri::HTML5.parse("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(doc, tags, doc.at_css("div"))
frag.to_s # => "<p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C"
doc.at_css("div").children = frag
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>"

I expect that the DOM structure should be identical, whether parsed as a document or as a fragment with a <div> context node.

I'll schedule some time to dig into the gumbo fragment parsing. Pinging @stevecheckoway for awareness and advice.

@flavorjones flavorjones added topic/gumbo Gumbo HTML5 parser and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Sep 14, 2022
@dan42
Copy link
Author

dan42 commented Sep 14, 2022

But the strange thing is that this isn't just about fragment parsing.

require "nokogiri"

tags = "<p>A <table><tr><td>B</table> C"

base = Nokogiri::HTML5.fragment("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(base.document, tags, base.at_css("div"))
frag.to_s #=> "<p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p>"

In this case the result is correct even though we're using fragment parsing. The bug seems to occur only when parsing a fragment WHILE using a document that was previously created via document parsing.

@flavorjones flavorjones added this to the v1.15.0 milestone Nov 22, 2022
@flavorjones
Copy link
Member

flavorjones commented Nov 27, 2022

OK, I've narrowed this down to whether the parser is in "quirks mode" or not.

Minimal repro

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

require "nokogiri"

tags = "<p><table></table>"

doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p></p><table></table></div>"

Diagnosis

In the C code for Gumbo.fragment there's this clause which in this case sets NO_QUIRKS mode:

  // Quirks mode.
  VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
  VALUE dtd = rb_funcall(doc, internal_subset, 0);
  if (NIL_P(dtd)) {
    quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
  } else {
    VALUE dtd_name = rb_funcall(dtd, name, 0);
    VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
    VALUE sysid = rb_funcall(dtd, rb_intern_const("system_id"), 0);
    quirks_mode = gumbo_compute_quirks_mode(
                    NIL_P(dtd_name) ? NULL : StringValueCStr(dtd_name),
                    NIL_P(pubid) ? NULL : StringValueCStr(pubid),
                    NIL_P(sysid) ? NULL : StringValueCStr(sysid)
                  );
  }

What's happening in this case is that if we parse <div><p><table></table> as a document, then gumbo puts the document into GUMBO_DOCTYPE_QUIRKS and so when handle_in_body sees the table tag this clause doesn't close the p tag:

  if (tag_is(token, kStartTag, GUMBO_TAG_TABLE)) {
    if (
      get_document_node(parser)->v.document.doc_type_quirks_mode
        != GUMBO_DOCTYPE_QUIRKS
    ) {
      maybe_implicitly_close_p_tag(parser, token);
    }

However, when we parse <p><table></table> as a fragment in the context of the div tag, then quirks mode is reset to NO_QUIRKS and so that clause does close the p tag.

Possible solutions

@stevecheckoway I wanted to ask you about this behavior. Rather than try to re-compute quirks mode during fragment parsing initialization, would it be preferable to persist the quirks mode as an attribute on the Nokogiri::HTML5::Document object, and then have any fragment parsing within that document use that value?

This is what I'm suggesting:

diff --git a/ext/nokogiri/gumbo.c b/ext/nokogiri/gumbo.c
index c4211d3..5150145 100644
--- a/ext/nokogiri/gumbo.c
+++ b/ext/nokogiri/gumbo.c
@@ -361,6 +361,7 @@ parse_continue(VALUE parse_args)
   build_tree(doc, (xmlNodePtr)doc, output->document);
   VALUE rdoc = noko_xml_document_wrap(args->klass, doc);
   rb_iv_set(rdoc, "@url", args->url_or_frag);
+  rb_iv_set(rdoc, "@quirks_mode", INT2NUM(output->document->v.document.doc_type_quirks_mode));
   args->doc = NULL; // The Ruby runtime now owns doc so don't delete it.
   add_errors(output, rdoc, args->input, args->url_or_frag);
   return rdoc;
@@ -517,7 +518,10 @@ fragment(
   // Quirks mode.
   VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
   VALUE dtd = rb_funcall(doc, internal_subset, 0);
-  if (NIL_P(dtd)) {
+  VALUE document_quirks_mode = rb_iv_get(doc, "@quirks_mode");
+  if (!NIL_P(document_quirks_mode)) {
+    quirks_mode = NUM2INT(document_quirks_mode);
+  } else if (NIL_P(dtd)) {
     quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
   } else {
     VALUE dtd_name = rb_funcall(dtd, name, 0);

The result of this patch is that the fragment "inherits" the quirks mode from the document, and applies it, and so the result is consistent:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

require "nokogiri"

tags = "<p><table></table>"

doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

If that seems like the right thing to do, I'm happy to write up some tests and submit a PR. Let me know if you have other ideas?

Update: I guess one more idea is simply to have the fragment default to QUIRKS_MODE? I haven't done the work to understand when/why the document is put into quirks mode, so I'm not sure if this is appropriate ... but handle_initial does seem to default to QUIRKS.

@flavorjones flavorjones changed the title [bug] parsing of <p> is different when a context is specified via coerce [bug] parsing of <p> is different when a context is specified via coerce Nov 27, 2022
@flavorjones flavorjones changed the title [bug] parsing of <p> is different when a context is specified via coerce [bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling Nov 27, 2022
@stevecheckoway
Copy link
Contributor

I'm sorry I didn't respond earlier. I missed this issue entirely.

The underlying bug is the fragment() incorrectly detects the quirks mode of the document. If there is no DOCTYPE, then the document should be in quirks mode.

See "Anything else" in The "initial" insertion mode.

@stevecheckoway I wanted to ask you about this behavior. Rather than try to re-compute quirks mode during fragment parsing initialization, would it be preferable to persist the quirks mode as an attribute on the Nokogiri::HTML5::Document object, and then have any fragment parsing within that document use that value?

I think this is exactly the right approach (modulo fixing my bug when there's no DOCTYPE). The rules for parsing HTML fragments explicitly say that the document created for fragment parsing has the same quirks as the context node's document.

Update: I guess one more idea is simply to have the fragment default to QUIRKS_MODE? I haven't done the work to understand when/why the document is put into quirks mode, so I'm not sure if this is appropriate ... but handle_initial does seem to default to QUIRKS.

If the context node comes from a real document, then we should do what the standard says and copy the quirks. If there is no real context node, then this is a judgment call but I think we should go with no quirks mode for two reasons.

  1. The document will be in quirks mode when a DOCTYPE is omitted but that's a parse error.
  2. p elements contain phrasing content which doesn't include table elements and the quirks mode puts the table inside the p.

This is the only place in the parser that I'm aware of that the quirks mode actually matters. (There's a limited quirks mode that never seems to matter.)

irb(main):001:0> Nokogiri::HTML5.parse('<!DOCTYPE html><p><table></table>').to_s
=> "<!DOCTYPE html><html><head></head><body><p></p><table></table></body></html>"
irb(main):002:0> Nokogiri::HTML5.parse('<p><table></table>').to_s
=> "<html><head></head><body><p><table></table></p></body></html>"

@flavorjones flavorjones modified the milestones: v1.15.0, v1.14.0 Nov 27, 2022
@flavorjones
Copy link
Member

flavorjones commented Nov 28, 2022

@stevecheckoway Thanks for the complete and thoughtful response. I'll open a PR in the next day or so and ask for your review to make sure the implementation matches this description.

flavorjones added a commit that referenced this issue Dec 20, 2022
and make sure the behavior of quirks mode matches the discussion at
issue #2646.
@flavorjones
Copy link
Member

see PR at #2736

flavorjones added a commit that referenced this issue Dec 20, 2022
and make sure the behavior of quirks mode matches the discussion at
issue #2646.
@flavorjones
Copy link
Member

Will be fixed in the v1.14.0 release!

flavorjones added a commit that referenced this issue Jun 21, 2024
This more completely implements the guidance originally agreed upon in
issue #2646, as there is no "context node com[ing] from a real
document" when a tag name is provided.
flavorjones added a commit that referenced this issue Jun 22, 2024
…ext (#3246)

**What problem is this PR intended to solve?**

Coming from the discussion at #3203, I wanted to improve the fragment
parsing API

- as discussed in #2646, parse in no-quirks mode if a context element
name is provided (not a context `Node`, just the name)
- allow passing `:context` kwarg to `DocumentFragment.new` and `.parse`
- deprecate the positional options hash to `.parse` per notes at #3200


**Have you included adequate test coverage?**

Included additional coverage for the API changes


**Does this change affect the behavior of either the C or the Java
implementations?**

HTML5 is only in CRuby.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/gumbo Gumbo HTML5 parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants