-
-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
substitute_entities #67
Comments
I just realised that attributes are returned as UTF-8, which makes it rater inconsistent to return entity-encoded strings for text-nodes:
|
Hi! Thanks for using Nokogiri, and for reporting your issues. We love hearing from our users. I'm sorry I didn't respond sooner, but when a test case is not provided, it often means we need to schedule some time to reproduce the issues you're talking about. First, I completely agree with you that setting such behavior globally is a Really Bad Idea. Unfortunately, since Nokogiri uses libxml2, we are bound to their API, and as you can see here and read about here, this is simply how libxml2 implements the functionality, and there's not much we can do about that in the short-term. Next, you are absolutely correct that passing 'true' results in an error. For now, passing a 1 or a 0 should be sufficient. However, it does appear that this functionality is broken in at least some versions of libxml2. I'll double check with a straight-C program, and let you know if I can get it to work at all. Lastly, I am a little confused. You seem to be conflating the issue of entity-escaping with encoding. These are two distinct functions. You can choose your own encoding when you serialize the document. If you want UTF-8 in your above example, simply run:
If you have a specific case (other than libxml2's apparently-broken default entity escaping setting) in which Nokogiri isn't behaving the way you expect it to, please send us a failing test which is explicit about what your expectations are. |
Commit 7c06969 fixes doc strings and includes a (failing) substitute_entities test. |
I see. In that case, it would probably be a good idea to dictate one behaviour or the other. That way, people won't run into incompatibilities.
I assume that what happens internally, is that if you try to serialize a string, which isn't represented in the target encoding (eg. iso-8859-1, utf-8 etc.), you get entities instead. So when you simply call
|
removing substitute_entities= and load_external_subsets=. there are other non-global ways to do this. closed by 0a7479a. |
According to the documentation, you can change the internal string encoding that Nokogiri uses. There are a couple of problems with this.
First, it doesn't work as advertised. If you set
substitute_entities
to a boolean, you get an error. If you set it to an integer, it doesn't change behaviour. I'm sure this can be fixed somehow.The second problem is more severe. Setting such behaviour globally is a really bad idea (tm). If two modules both use Nokogiri, and expect different behaviour, they would not be able to coexist. You should at the very least move this setting to the document-level - Perhaps even provide an alternative for
Node#to_s
that gives back an explicit encoding. In any case, I was rather confused by the choice of html-escaped encoding for strings. I would have expected utf-8 to be default.The text was updated successfully, but these errors were encountered: