Description
When looking at the code I noticed that there was a break in #[inline]
propagation on functions. Given that #[inline]
is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.
When running tests with this setup:
use quick_xml::events::Event;
use quick_xml::reader::Reader;
static XML: &str = include_str!(r"XML");
fn main() {
quick_xml();
}
fn quick_xml() {
let mut reader = Reader::from_str(XML);
reader.trim_text(true);
let mut buf = Vec::with_capacity(1024 * 8);
loop {
match reader.read_event_into(&mut buf) {
Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
Ok(Event::Eof) => break,
Ok(Event::Start(_)) => {}
Ok(Event::Text(_)) => {}
_ => (),
}
buf.clear();
}
}
I noticed that although this function has #[inline]
:
#[inline]
pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {
self.read_event_impl(buf)
}
The next funtion its calling does not:
fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
{
read_event_impl!(self, buf, self.reader, read_until_open, read_until_close)
}
In trying to see the effects of the break in the #[inline]
propgation, I ran a few tests. I went through and at first added #[inline]
to the functions so that there is an end-to-end #[inline]
chain, and then later went through and changed it to be #[inline(always)]
to see what the differences would be.
I ran lto=true
as well to see if there is any benefit to #[inline]
or if it just ultimately matches what a manual #[inline]
decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.
The xml file was a 1.3GB file from the worksheet of this dataset.
Benchmarks
quick_xml: master branch
Time (mean ± σ): 5.941 s ± 0.018 s [User: 5.614 s, System: 0.350 s]
Range (min … max): 5.915 s … 5.976 s 10 runs
quick_xml: master branch + lto
Time (mean ± σ): 4.623 s ± 0.049 s [User: 4.231 s, System: 0.395 s]
Range (min … max): 4.579 s … 4.719 s 10 runs
// I might have missed coverage for this one, but the main functions were gotten for sure
quick_xml: inline
Time (mean ± σ): 5.236 s ± 0.240 s [User: 4.877 s, System: 0.353 s]
Range (min … max): 5.123 s … 5.916 s 10 runs
quick_xml: inline(always)
Time (mean ± σ): 4.232 s ± 0.022 s [User: 3.843 s, System: 0.385 s]
Range (min … max): 4.201 s … 4.255 s 10 runs
quick_xml: inline(always) + lto
Time (mean ± σ): 4.148 s ± 0.017 s [User: 3.744 s, System: 0.393 s]
Range (min … max): 4.116 s … 4.162 s 10 runs
Seeing that the changes even benefited lto
, I think this is a worth while add.
Profile
One thing I did notice was a memcpy
instruction that was 25% of the samples:
I'm not sure if this is part of the program or if its just an artifact of the profiling.