SAX::Parser errors when it encounters non-predefined entities. #1926

searls · 2019-09-22T03:40:55Z

Describe the bug

When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.

Note that this warning from libxml2's docs seem to hint that getting this right is hard:

WARNING: handling entities on top of the libxml2 SAX interface is difficult!!!* If you plan to use non-predefined entities in your documents, then the learning curve to handle then using the SAX API may be long. If you plan to use complex documents, I strongly suggest you consider using the DOM interface instead and let libxml deal with the complexity rather than trying to do it yourself.

To Reproduce

#! /usr/bin/env ruby

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with DOM parser"
doc = Nokogiri::XML.parse(xml)
pp doc

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def error(s)
    raise s
  end
end

Nokogiri::XML::SAX::Parser.new(StuffDoc.new).parse(xml)

When run, this will output:

----> parsing with DOM parser
#(Document:0x3fd9cdca51ac {
  name = "document",
  children = [
    #(DTD:0x3fd9cdca96a8 {
      name = "Stuff",
      children = [
        #(ElementDecl:0x3fd9cdca862c { name = "stuff" }),
        #(EntityDecl:0x3fd9cdcad8ac {
          name = "THING",
          children = [ #(Text "a thing")]
          })]
      }),
    #(Element:0x3fd9cdcac86c {
      name = "stuff",
      children = [ #(EntityReference:0x3fd9cdcb1f9c { name = "THING" })]
      })]
  })
----> parsing with SAX parser
Traceback (most recent call last):
	4: from demo.rb:24:in `<main>'
	3: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
	2: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_memory'
	1: from /Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/lib/nokogiri/xml/sax/parser.rb:110:in `parse_with'
demo.rb:20:in `error': Entity 'THING' not defined (RuntimeError)

Expected behavior

I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g. "&THING;" in this case.

Environment

# Nokogiri (1.10.4)
    ---
    warnings: []
    nokogiri: 1.10.4
    ruby:
      version: 2.6.3
      platform: x86_64-darwin18
      description: ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxml2/2.9.9"
      libxslt_path: "/Users/justin/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/nokogiri-1.10.4/ports/x86_64-apple-darwin18.6.0/libxslt/1.1.33"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      libxslt_patches:
      - 0001-Fix-security-framework-bypass.patch
      compiled: 2.9.9
      loaded: 2.9.9

Additional context

This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)

The text was updated successfully, but these errors were encountered:

flavorjones · 2019-09-22T17:52:42Z

Hey @searls, thanks for the clear bug report. I wanted to acknowledge that I saw this and let you know I will likely not be able to dig into it immediately, but I agree with your take that the described behavior is problematic.

tenderlove · 2019-09-23T16:03:56Z

I did a quick look in to this. I knew I had looked in to this at one point, and now the code is jogging my memory. The SAX parser will actually build a DOM object in order to support entity substitution. I'm not sure if it actually builds the full DOM, but it definitely keeps the entity references on the document structs. In order to support this we need to make sure the myDoc part of the struct is filled out.

I thought the easiest approach would be to initialize the SAX parser with all the default callbacks, but that doesn't seem to work because the context pointer we're using is a nokogiriSAXTuple * where I think all the default callbacks are expecting a xmlParserCtxtPtr.

I think there are two approaches we could take to fix this:

Embed xmlParserCtxt at the head of our nokogiriSAXTuple struct
Implement all the default SAX callbacks and unwrap the nokogiriSAXTuple

I'm not sure if the first approach is possible, and the second approach sounds like a bigger patch.

tenderlove · 2019-09-23T16:17:08Z

I forgot to mention, there is a callback for when an entity is encountered. But the huge bummer is that the callback is expected to return a xmlEntityPtr. We can't simply return NULL, otherwise you'll get the same error (but with a callback executed!).

With this patch:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 1a5f6c5f..7cc25524 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -7,7 +7,7 @@ static ID id_start_document, id_end_document, id_start_element, id_end_element;
 static ID id_start_element_namespace, id_end_element_namespace;
 static ID id_comment, id_characters, id_xmldecl, id_error, id_warning;
 static ID id_cdata_block, id_cAttribute;
-static ID id_processing_instruction;
+static ID id_processing_instruction, id_get_entity;
 
 static void start_document(void * ctx)
 {
@@ -251,6 +251,21 @@ static void processing_instruction(void * ctx, const xmlChar * name, const xmlCh
   );
 }
 
+static xmlEntityPtr get_entity(void * ctx, const xmlChar *name)
+{
+  VALUE rb_content;
+  VALUE self = NOKOGIRI_SAX_SELF(ctx);
+  VALUE doc = rb_iv_get(self, "@document");
+
+  rb_funcall( doc,
+              id_get_entity,
+              1,
+              NOKOGIRI_STR_NEW2(name)
+  );
+
+  return NULL;
+}
+
 static void deallocate(xmlSAXHandlerPtr handler)
 {
   NOKOGIRI_DEBUG_START(handler);
@@ -276,6 +291,7 @@ static VALUE allocate(VALUE klass)
   handler->error = error_func;
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
+  handler->getEntity = get_entity;
   handler->initialized = XML_SAX2_MAGIC;
 
   return Data_Wrap_Struct(klass, NULL, deallocate, handler);
@@ -303,6 +319,7 @@ void init_xml_sax_parser()
   id_error          = rb_intern("error");
   id_warning        = rb_intern("warning");
   id_cdata_block    = rb_intern("cdata_block");
+  id_get_entity     = rb_intern("get_entity");
   id_cAttribute     = rb_intern("Attribute");
   id_start_element_namespace = rb_intern("start_element_namespace");
   id_end_element_namespace = rb_intern("end_element_namespace");

And this script:

xml = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE Stuff [
  <!ELEMENT stuff (#PCDATA)>
  <!ENTITY THING "a thing">
  ]>
  <stuff>&THING;</stuff>
XML

require "nokogiri"
require "pp"

puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
  def get_entity name
    p [__method__, name]
  end

  def error(s)
    p [__method__, s]
  end
end

parser = Nokogiri::XML::SAX::Parser.new(StuffDoc.new)
parser.parse(xml)

Output is:

----> parsing with SAX parser
[:get_entity, "THING"]
[:get_entity, "THING"]
[:error, "Entity 'THING' not defined\n"]

flavorjones · 2019-09-27T13:04:59Z

@tenderlove Thanks for looking into this. I've got performance concerns about option 2 you outlined above. If you recall the Fairy Wing Throwdown from 2011, I suspected that the poor performance of the SAX parser was due to callbacks. Maybe we should actually benchmark what happens if we implement all the defaults so we know for sure if this is reasonable to do?

nwellnhof · 2023-12-27T13:44:24Z

You have to implement the getEntity callback in libxml2's SAX interface. This should return a pointer to an xmlEntity struct which can be created with xmlNewEntity(). Make sure to set the doc argument to NULL, so the document's entity table won't be polluted. etype should be XML_INTERNAL_PREDEFINED_ENTITY for literal string or XML_INTERNAL_GENERAL_ENTITY for XML data to be parsed. To simplify memory management, it should be safe to free the old struct if getEntity is invoked again.

flavorjones · 2024-03-18T02:12:30Z

OK, so I have a branch where I've implemented a default getEntity callback that invokes libxml2's xmlSAX2GetEntity to return the correct entity from the document:

diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 989ad9eb..622e8159 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -265,6 +265,12 @@ processing_instruction(void *ctx, const xmlChar *name, const xmlChar *content)
             );
 }
 
+static xmlEntityPtr
+get_entity(void *ctx, const xmlChar *name)
+{
+  return xmlSAX2GetEntity(NOKOGIRI_SAX_CTXT(ctx), name);
+}
+
 static size_t
 memsize(const void *data)
 {
@@ -300,6 +306,7 @@ allocate(VALUE klass)
   handler->cdataBlock = cdata_block;
   handler->processingInstruction = processing_instruction;
   handler->initialized = XML_SAX2_MAGIC;
+  handler->getEntity = get_entity;
 
   return self;
 }

The good news is that with this change, errors are no longer reported for entities that are properly declared in the DTD.

The less-good news is that the expansion of the entity is passed to the characters callback, meaning that detecting it as an entity as such is not easy. We could add a callback Nokogiri::XML::SAX::Document#get_entity which could be passed the name and the expansion, but there doesn't seem to be a way to prevent libxml2 from invoking the characters callback.

This change would have some implications for the design of eiwa which uses the Eiwa::Tag::Entity class.

@searls what are your thoughts here? Using Nokogiri with above patch, I can make all eiwa tests pass by applying this patch:

diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..eeb91880 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -62,15 +62,7 @@ def characters(s)
       # end
 
       def error(msg)
-        if (matches = msg.match(/Entity '(\S+)' not defined/))
-          # See: http://github.com/sparklemotion/nokogiri/issues/1926
-          code = matches[1]
-          @current.set_entity(code, ENTITIES[code])
-        elsif msg == "Detected an entity reference loop\n"
-          # Do nothing and hope this does not matter.
-        else
-          raise Eiwa::Error.new("Parsing error: #{msg}")
-        end
+        raise Eiwa::Error.new("Parsing error: #{msg}")
       end
 
       # def cdata_block string
diff --git a/lib/eiwa/jmdict/entities.rb b/lib/eiwa/jmdict/entities.rb
index cf218d60..553e952c 100644
--- a/lib/eiwa/jmdict/entities.rb
+++ b/lib/eiwa/jmdict/entities.rb
@@ -13,7 +13,7 @@ module Jmdict
       "adj-ku" => "`ku' adjective (archaic)",
       "adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
       "adj-nari" => "archaic/formal form of na-adjective",
-      "adj-no" => "nouns which may take the genitive case particle `no'",
+      "adj-no" => "nouns which may take the genitive case particle 'no'",
       "adj-pn" => "pre-noun adjectival (rentaishi)",
       "adj-shiku" => "`shiku' adjective (archaic)",
       "adj-t" => "`taru' adjective",
@@ -34,7 +34,7 @@ module Jmdict
       "chem" => "chemistry term",
       "chn" => "children's language",
       "col" => "colloquialism",
-      "comp" => "computer terminology",
+      "comp" => "computing",
       "conj" => "conjunction",
       "cop" => "copula",
       "cop-da" => "copula",
@@ -101,6 +101,7 @@ module Jmdict
       "quote" => "quotation",
       "rare" => "rare",
       "rkb" => "Ryuukyuu-ben",
+      "rK" => "rarely used kanji form",
       "sens" => "sensitive",
       "shogi" => "shogi term",
       "sl" => "slang",
diff --git a/lib/eiwa/tag/entity.rb b/lib/eiwa/tag/entity.rb
index 12f4f8c7..a75263a9 100644
--- a/lib/eiwa/tag/entity.rb
+++ b/lib/eiwa/tag/entity.rb
@@ -4,10 +4,13 @@ class Entity < Any
       attr_reader :code, :text
 
       def initialize(code: nil, text: nil)
-        @code = code
         @text = text
       end
 
+      def add_characters(s)
+        @text = s
+      end
+
       def set_entity(code, text)
         @code = code
         @text = text

nwellnhof · 2024-03-18T03:22:42Z

IIRC, custom SAX parsers can only work in entity replacement mode (XML_PARSE_NOENT). Without this option, the callback sequence is a bit nonsensical.

flavorjones · 2024-03-18T13:39:27Z

@nwellnhof Just to make sure I understand your meaning -- are you saying that there's no way to avoid the characters callback being invoked for entities?

nwellnhof · 2024-03-18T14:21:08Z

are you saying that there's no way to avoid the characters callback being invoked for entities?

Yes, but when substituting entities, this should be what you want.

searls · 2024-03-19T12:32:08Z

@flavorjones this seems about right? As long as I'm able to retrieve the code (i.e. "uk"), I'm happy.

flavorjones · 2024-06-23T16:22:29Z

@searls Sorry for my delayed response. To be clear, the above proposal does NOT give you access to the entity name (code).

I think we have a few options that might be able to solve this problem fully:

implement the entityDecl callback in libxml2's SAX model, which will get invoked when the entity declaration is processed. Eiwa can then build a hash table, and when Entity#characters is called, look it up in the hash table to find the code.
implement a new callback in nokogiri for when an entity is encountered and resolved in a document. Unfortunately we can't prevent libxml2 from also invoking the characters callback, which isn't a problem for Eiwa, but also makes this callback not very useful in the general case.
not sure if this is a reasonable suggestion, but would you consider dropping @code from the Eiwa::Tag::Entity class? Do you know if this is used by Eiwa users?

WDYT? I'm leaning towards (1) which seems aligned with libxml2's design and I think is more generally useful.

searls · 2024-06-23T21:27:33Z

I think the first course of action would be best, as it would improve the utility of SAX for all libxml2 users. Good idea!

nwellnhof · 2024-06-24T18:23:22Z

I'd also suggest a default implementation of the entityDecl and getEntity callbacks similar to the following code in libxml2's test suite: https://gitlab.gnome.org/GNOME/libxml2/-/blob/master/runtest.c?ref_type=heads#L721

flavorjones · 2024-06-30T18:55:57Z

deep breath this has been a real journey!

I've got a branch (which just needs some final polish) that

removes our crufty SAX-tuple struct and replaces it with code that is compatible with libxml2's default SAX handlers
uses some of the default SAX handlers to ensure DTDs, entity declarations, and entity references are handled properly (fixing this issue)
implements the SAX callback reference (for non-predefined entities when #replace_entities is false (the default))

I've also fixed some quirky behavior in the JRuby impl along the way.

As a result, the fix to eiwa is something like:

diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..6d554119 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -53,6 +53,10 @@ def characters(s)
         @current.add_characters(s)
       end

+      def reference(name, content)
+        @current.set_entity(name, content)
+      end
+
       # def comment string
       #   puts "comment #{string}"
       # end

which, since it does nothing on older versions of Nokogiri, is backwards-compatible. (I will submit a PR to eiwa shortly.)

I hope to have that branch up in a PR tomorrow! 😅 Thanks for your patience.

searls · 2024-07-01T13:20:45Z

Nice! Sounds like a big win from a tech debt perspective

On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).

The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.

flavorjones · 2024-07-01T20:59:53Z

PR is at #3265

This works around the issues reported at: - sparklemotion/nokogiri#1926 - sparklemotion/nokogiri#3147 Closes searls#10.

The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.

**What problem is this PR intended to solve?** #1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem. I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the `_private` struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue). This PR also introduces a new feature, a SAX handler callback `Document#reference` which allows callers to get entity-specific name and replacement text information (rather than relying on the `Document#characters` callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11 The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the `XML::SAX::Document` docstring with a table and explanation. I've also added warnings to remind users that `#replace_entities` is not safe to set when parsing untrusted documents. In the Java implementation, I've fixed the `#replace_entities` option in the SAX parser context and set it to the proper default (`false`), fixing #614. I've also corrected the value of the URI argument to `Document#start_element_namespace` which was a blank string when it should have been `nil`. I've added quite a bit of testing around the SAX parser's handling of entities. I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284. Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention). Closes #1926. Closes #614. **Have you included adequate test coverage?** Yes! **Does this change affect the behavior of either the C or the Java implementations?** Yes, but the implementations are much more consistent with each other now.

flavorjones added the needs/research label Sep 22, 2019

flavorjones added the topic/entities label Sep 24, 2021

flavorjones mentioned this issue Sep 24, 2021

Add failing test case to keep entities in sax parsers #1500

Closed

searls mentioned this issue Mar 12, 2024

[bug] After encountering 100 unknown XML entities, the SAX parser stops calling Nokogiri::XML::SAX::Document#error #3147

Closed

flavorjones mentioned this issue Mar 18, 2024

Misc. codes are no longer being read correctly searls/eiwa#10

Closed

flavorjones added this to the v1.17.0 milestone Jun 30, 2024

flavorjones mentioned this issue Jun 30, 2024

replace_entities ignored for SAX parser (nokogiri 1.5.0-java / jruby 1.6.6) #614

Closed

flavorjones mentioned this issue Jul 1, 2024

fix, feat, docs: improve sax parser entity handling #3265

Merged

flavorjones mentioned this issue Jul 1, 2024

sax parser with replace_entities = false still replaces entities #1284

Closed

flavorjones added a commit to flavorjones/eiwa that referenced this issue Jul 1, 2024

fix: in Nokogiri >= v1.17.0, use SAX::Document#reference

04cdbda

This works around the issues reported at: - sparklemotion/nokogiri#1926 - sparklemotion/nokogiri#3147 Closes searls#10.

flavorjones mentioned this issue Jul 1, 2024

fix: in Nokogiri >= v1.17.0, use SAX::Document#reference searls/eiwa#11

Merged

flavorjones closed this as completed in #3265 Jul 2, 2024

flavorjones closed this as completed in 0d157f5 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAX::Parser errors when it encounters non-predefined entities. #1926

SAX::Parser errors when it encounters non-predefined entities. #1926

searls commented Sep 22, 2019 •

edited

Loading

flavorjones commented Sep 22, 2019

tenderlove commented Sep 23, 2019

tenderlove commented Sep 23, 2019

flavorjones commented Sep 27, 2019

nwellnhof commented Dec 27, 2023

flavorjones commented Mar 18, 2024

nwellnhof commented Mar 18, 2024

flavorjones commented Mar 18, 2024

nwellnhof commented Mar 18, 2024

searls commented Mar 19, 2024

flavorjones commented Jun 23, 2024

searls commented Jun 23, 2024

nwellnhof commented Jun 24, 2024

flavorjones commented Jun 30, 2024

searls commented Jul 1, 2024

flavorjones commented Jul 1, 2024

SAX::Parser errors when it encounters non-predefined entities. #1926

SAX::Parser errors when it encounters non-predefined entities. #1926

Comments

searls commented Sep 22, 2019 • edited Loading

flavorjones commented Sep 22, 2019

tenderlove commented Sep 23, 2019

tenderlove commented Sep 23, 2019

flavorjones commented Sep 27, 2019

nwellnhof commented Dec 27, 2023

flavorjones commented Mar 18, 2024

nwellnhof commented Mar 18, 2024

flavorjones commented Mar 18, 2024

nwellnhof commented Mar 18, 2024

searls commented Mar 19, 2024

flavorjones commented Jun 23, 2024

searls commented Jun 23, 2024

nwellnhof commented Jun 24, 2024

flavorjones commented Jun 30, 2024

searls commented Jul 1, 2024

flavorjones commented Jul 1, 2024

searls commented Sep 22, 2019 •

edited

Loading