-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX::Parser errors when it encounters non-predefined entities. #1926
Comments
Hey @searls, thanks for the clear bug report. I wanted to acknowledge that I saw this and let you know I will likely not be able to dig into it immediately, but I agree with your take that the described behavior is problematic. |
I did a quick look in to this. I knew I had looked in to this at one point, and now the code is jogging my memory. The SAX parser will actually build a DOM object in order to support entity substitution. I'm not sure if it actually builds the full DOM, but it definitely keeps the entity references on the document structs. In order to support this we need to make sure the I thought the easiest approach would be to initialize the SAX parser with all the default callbacks, but that doesn't seem to work because the context pointer we're using is a I think there are two approaches we could take to fix this:
I'm not sure if the first approach is possible, and the second approach sounds like a bigger patch. |
I forgot to mention, there is a callback for when an entity is encountered. But the huge bummer is that the callback is expected to return a With this patch: diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 1a5f6c5f..7cc25524 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -7,7 +7,7 @@ static ID id_start_document, id_end_document, id_start_element, id_end_element;
static ID id_start_element_namespace, id_end_element_namespace;
static ID id_comment, id_characters, id_xmldecl, id_error, id_warning;
static ID id_cdata_block, id_cAttribute;
-static ID id_processing_instruction;
+static ID id_processing_instruction, id_get_entity;
static void start_document(void * ctx)
{
@@ -251,6 +251,21 @@ static void processing_instruction(void * ctx, const xmlChar * name, const xmlCh
);
}
+static xmlEntityPtr get_entity(void * ctx, const xmlChar *name)
+{
+ VALUE rb_content;
+ VALUE self = NOKOGIRI_SAX_SELF(ctx);
+ VALUE doc = rb_iv_get(self, "@document");
+
+ rb_funcall( doc,
+ id_get_entity,
+ 1,
+ NOKOGIRI_STR_NEW2(name)
+ );
+
+ return NULL;
+}
+
static void deallocate(xmlSAXHandlerPtr handler)
{
NOKOGIRI_DEBUG_START(handler);
@@ -276,6 +291,7 @@ static VALUE allocate(VALUE klass)
handler->error = error_func;
handler->cdataBlock = cdata_block;
handler->processingInstruction = processing_instruction;
+ handler->getEntity = get_entity;
handler->initialized = XML_SAX2_MAGIC;
return Data_Wrap_Struct(klass, NULL, deallocate, handler);
@@ -303,6 +319,7 @@ void init_xml_sax_parser()
id_error = rb_intern("error");
id_warning = rb_intern("warning");
id_cdata_block = rb_intern("cdata_block");
+ id_get_entity = rb_intern("get_entity");
id_cAttribute = rb_intern("Attribute");
id_start_element_namespace = rb_intern("start_element_namespace");
id_end_element_namespace = rb_intern("end_element_namespace"); And this script: xml = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Stuff [
<!ELEMENT stuff (#PCDATA)>
<!ENTITY THING "a thing">
]>
<stuff>&THING;</stuff>
XML
require "nokogiri"
require "pp"
puts "----> parsing with SAX parser"
class StuffDoc < Nokogiri::XML::SAX::Document
def get_entity name
p [__method__, name]
end
def error(s)
p [__method__, s]
end
end
parser = Nokogiri::XML::SAX::Parser.new(StuffDoc.new)
parser.parse(xml) Output is:
|
@tenderlove Thanks for looking into this. I've got performance concerns about option 2 you outlined above. If you recall the Fairy Wing Throwdown from 2011, I suspected that the poor performance of the SAX parser was due to callbacks. Maybe we should actually benchmark what happens if we implement all the defaults so we know for sure if this is reasonable to do? |
You have to implement the |
OK, so I have a branch where I've implemented a default diff --git a/ext/nokogiri/xml_sax_parser.c b/ext/nokogiri/xml_sax_parser.c
index 989ad9eb..622e8159 100644
--- a/ext/nokogiri/xml_sax_parser.c
+++ b/ext/nokogiri/xml_sax_parser.c
@@ -265,6 +265,12 @@ processing_instruction(void *ctx, const xmlChar *name, const xmlChar *content)
);
}
+static xmlEntityPtr
+get_entity(void *ctx, const xmlChar *name)
+{
+ return xmlSAX2GetEntity(NOKOGIRI_SAX_CTXT(ctx), name);
+}
+
static size_t
memsize(const void *data)
{
@@ -300,6 +306,7 @@ allocate(VALUE klass)
handler->cdataBlock = cdata_block;
handler->processingInstruction = processing_instruction;
handler->initialized = XML_SAX2_MAGIC;
+ handler->getEntity = get_entity;
return self;
} The good news is that with this change, errors are no longer reported for entities that are properly declared in the DTD. The less-good news is that the expansion of the entity is passed to the This change would have some implications for the design of @searls what are your thoughts here? Using Nokogiri with above patch, I can make all eiwa tests pass by applying this patch: diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..eeb91880 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -62,15 +62,7 @@ def characters(s)
# end
def error(msg)
- if (matches = msg.match(/Entity '(\S+)' not defined/))
- # See: http://github.com/sparklemotion/nokogiri/issues/1926
- code = matches[1]
- @current.set_entity(code, ENTITIES[code])
- elsif msg == "Detected an entity reference loop\n"
- # Do nothing and hope this does not matter.
- else
- raise Eiwa::Error.new("Parsing error: #{msg}")
- end
+ raise Eiwa::Error.new("Parsing error: #{msg}")
end
# def cdata_block string
diff --git a/lib/eiwa/jmdict/entities.rb b/lib/eiwa/jmdict/entities.rb
index cf218d60..553e952c 100644
--- a/lib/eiwa/jmdict/entities.rb
+++ b/lib/eiwa/jmdict/entities.rb
@@ -13,7 +13,7 @@ module Jmdict
"adj-ku" => "`ku' adjective (archaic)",
"adj-na" => "adjectival nouns or quasi-adjectives (keiyodoshi)",
"adj-nari" => "archaic/formal form of na-adjective",
- "adj-no" => "nouns which may take the genitive case particle `no'",
+ "adj-no" => "nouns which may take the genitive case particle 'no'",
"adj-pn" => "pre-noun adjectival (rentaishi)",
"adj-shiku" => "`shiku' adjective (archaic)",
"adj-t" => "`taru' adjective",
@@ -34,7 +34,7 @@ module Jmdict
"chem" => "chemistry term",
"chn" => "children's language",
"col" => "colloquialism",
- "comp" => "computer terminology",
+ "comp" => "computing",
"conj" => "conjunction",
"cop" => "copula",
"cop-da" => "copula",
@@ -101,6 +101,7 @@ module Jmdict
"quote" => "quotation",
"rare" => "rare",
"rkb" => "Ryuukyuu-ben",
+ "rK" => "rarely used kanji form",
"sens" => "sensitive",
"shogi" => "shogi term",
"sl" => "slang",
diff --git a/lib/eiwa/tag/entity.rb b/lib/eiwa/tag/entity.rb
index 12f4f8c7..a75263a9 100644
--- a/lib/eiwa/tag/entity.rb
+++ b/lib/eiwa/tag/entity.rb
@@ -4,10 +4,13 @@ class Entity < Any
attr_reader :code, :text
def initialize(code: nil, text: nil)
- @code = code
@text = text
end
+ def add_characters(s)
+ @text = s
+ end
+
def set_entity(code, text)
@code = code
@text = text |
IIRC, custom SAX parsers can only work in entity replacement mode (XML_PARSE_NOENT). Without this option, the callback sequence is a bit nonsensical. |
@nwellnhof Just to make sure I understand your meaning -- are you saying that there's no way to avoid the |
Yes, but when substituting entities, this should be what you want. |
@flavorjones this seems about right? As long as I'm able to retrieve the code (i.e. "uk"), I'm happy. |
@searls Sorry for my delayed response. To be clear, the above proposal does NOT give you access to the entity name (code). I think we have a few options that might be able to solve this problem fully:
WDYT? I'm leaning towards (1) which seems aligned with libxml2's design and I think is more generally useful. |
I think the first course of action would be best, as it would improve the utility of SAX for all libxml2 users. Good idea! |
I'd also suggest a default implementation of the |
deep breath this has been a real journey! I've got a branch (which just needs some final polish) that
I've also fixed some quirky behavior in the JRuby impl along the way. As a result, the fix to eiwa is something like: diff --git a/lib/eiwa/jmdict/doc.rb b/lib/eiwa/jmdict/doc.rb
index 8d2d4155..6d554119 100644
--- a/lib/eiwa/jmdict/doc.rb
+++ b/lib/eiwa/jmdict/doc.rb
@@ -53,6 +53,10 @@ def characters(s)
@current.add_characters(s)
end
+ def reference(name, content)
+ @current.set_entity(name, content)
+ end
+
# def comment string
# puts "comment #{string}"
# end which, since it does nothing on older versions of Nokogiri, is backwards-compatible. (I will submit a PR to eiwa shortly.) I hope to have that branch up in a PR tomorrow! 😅 Thanks for your patience. |
Nice! Sounds like a big win from a tech debt perspective |
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.
PR is at #3265 |
This works around the issues reported at: - sparklemotion/nokogiri#1926 - sparklemotion/nokogiri#3147 Closes searls#10.
The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.
**What problem is this PR intended to solve?** #1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem. I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the `_private` struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue). This PR also introduces a new feature, a SAX handler callback `Document#reference` which allows callers to get entity-specific name and replacement text information (rather than relying on the `Document#characters` callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11 The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the `XML::SAX::Document` docstring with a table and explanation. I've also added warnings to remind users that `#replace_entities` is not safe to set when parsing untrusted documents. In the Java implementation, I've fixed the `#replace_entities` option in the SAX parser context and set it to the proper default (`false`), fixing #614. I've also corrected the value of the URI argument to `Document#start_element_namespace` which was a blank string when it should have been `nil`. I've added quite a bit of testing around the SAX parser's handling of entities. I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284. Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention). Closes #1926. Closes #614. **Have you included adequate test coverage?** Yes! **Does this change affect the behavior of either the C or the Java implementations?** Yes, but the implementations are much more consistent with each other now.
Describe the bug
When an XML document contains non-predefined entities—even if the document defines those entities up-front—it will error when parsing with nokogiri's SAX parser.
Note that this warning from libxml2's docs seem to hint that getting this right is hard:
To Reproduce
When run, this will output:
Expected behavior
I honestly just don't want this to explode. I'd prefer to get a literal string of the entity (e.g.
"&THING;"
in this case.Environment
Additional context
This is a real problem for one important document, the JMDict XML file, which is a daily export of the most prominent community-maintained Japanese-English dictionary on the Internet. JMDict uses dozens of custom entities for tagging entries with various metadata. However, because the file is over 100MB, it's more appropriate for SAX parsing, which is how folks might run into this problem. (One example)
The text was updated successfully, but these errors were encountered: