Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing XML preceded by Unicode BOM triggers an error #155

Closed
dylanmckay opened this issue Aug 22, 2017 · 12 comments
Closed

Parsing XML preceded by Unicode BOM triggers an error #155

dylanmckay opened this issue Aug 22, 2017 · 12 comments

Comments

@dylanmckay
Copy link

When loading an XML file that begins with a byte order mark, xml-rs raises a parsing error.

Found indirectly when using the xmltree crate

Here is the first few bytes of my XML file (cat ATA6616C.atdf|xxd|head -n2)

00000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e  ...<?xml version
00000010: 3d27 312e 3027 2065 6e63 6f64 696e 673d  ='1.0' encoding=
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: MalformedXml(Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") })',
/checkout/src/libcore/result.rs:860
dylanmckay added a commit to avr-rust/avrd that referenced this issue Aug 22, 2017
This would cause a parsing error because the XML library does not handle
BOMs.

Required to work around netvl/xml-rs#155.
@dylanmckay
Copy link
Author

You can see the entire files that reproduce the error in the commit I just referenced above this comment.

@wucke13
Copy link

wucke13 commented Jul 1, 2019

Having the same problem

Error: Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") }

@lovasoa
Copy link

lovasoa commented May 19, 2020

Hello ! An user recently reported a bug to me that tracks down to this. I see there was a pull request opened in 2017 that has never been merged. @netvl, would you be ready to merge a PR that would implement the approach you mention in the PR of just adding a thin Read wrapper that ignores leading byte order marks ?

@netvl
Copy link
Owner

netvl commented May 20, 2020

@lovasoa if you want to create such a wrapper, you don't really need to add it to this library - it would be much easier to publish it as a standalone crate, because it certainly has utility outside of the domain of XML parsing.

As I mentioned elsewhere, in the reimplementation branch this problem will be fixed automatically due to how encoding detection with encoding_rs works.

@lovasoa
Copy link

lovasoa commented May 20, 2020

Yes, I understand I can work around this bug in my own programs, but I was proposing to work on a patch that will fix it for everyone, since it seems to affect many users. Of course, once you merge your branch adding support for other encodings, you can just ditch this Read wrapper.

@lovasoa
Copy link

lovasoa commented May 20, 2020

If you agree to merge a temporary solution, you can just close this bug today and hopefully never have to reopen it.

@netvl
Copy link
Owner

netvl commented May 20, 2020

work around this bug in my own programs

That's not what I meant. I meant that the wrapper does not really belong to the library, and it certainly won't be enable by default because it would require changing types in the API. Therefore, it would be much more beneficial to the wider community to have this as a separate, independent crate, rather than a part of xml_rs.

Having it as a separate crate does not have any downsides (except maybe another crate in dependencies, but with Cargo it's not a big problem), but in terms of general reuse it would be much better, for example, for people who don't need XML parsing but do want to remove BOMs for other purposes.

Even when used with this library the code would look almost the same:

// external crate
use bom_remover::BomRemover;

let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);
// built into xml_rs
use xml::BomRemover;

let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);

As a bonus point, you won't really need to wait for my review :) And I'm totally all right with adding a link to this new crate somewhere in the docs with a note that if people want to parse XML documents with BOMs, they can use this crate, with some examples.

@lovasoa
Copy link

lovasoa commented May 20, 2020

it would require changing types in the API

Why is that exactly ? Why would we need to expose this wrapper ? It should be completely transparent. The read wrapper will be created and stored as a private field in EventReader. Is there something that I do not see ?

The problem with having an external crate is that most people who want to parse XML in rust are simply not aware of this bug. They are not actively looking for ways to work around it. Just like me, they write their program, test it, everything works, and some time later, in production, the code encounters an xml string that starts with a BOM, and fails. I really think that this library is the right place to fix it, and I'm ready to help.

@lovasoa
Copy link

lovasoa commented May 20, 2020

Maybe I misunderstood your comment on the original PR ? You were not proposing a wrapper that would be internal to xml-rs ? How would you then fix the bug when parsing XML from a str, which may also start with a BOM character ?

@tp-m
Copy link
Contributor

tp-m commented Dec 17, 2021

In #213 I propose a from_bytes() convenience wrapper for this purposes.

@arielmendoza
Copy link

arielmendoza commented Apr 12, 2023

let mut xmlcontent=std::io::read_to_string(xmlfile).unwrap().replace("\u{feff}", "");

@kornelski
Copy link
Collaborator

BOM is now supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants