Skip to content

Non-Cascaded #[inline] Propagation #678

Open
@RoloEdits

Description

@RoloEdits

When looking at the code I noticed that there was a break in #[inline] propagation on functions. Given that #[inline] is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.

When running tests with this setup:

use quick_xml::events::Event;
use quick_xml::reader::Reader;

static XML: &str = include_str!(r"XML");

fn main() {
    quick_xml();
}

fn quick_xml() {
    let mut reader = Reader::from_str(XML);
    reader.trim_text(true);

    let mut buf = Vec::with_capacity(1024 * 8);

    loop {
        match reader.read_event_into(&mut buf) {
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            Ok(Event::Eof) => break,
            Ok(Event::Start(_)) => {}
            Ok(Event::Text(_)) => {}
            _ => (),
        }
        buf.clear();
    }
}

I noticed that although this function has #[inline]:

    #[inline]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {
        self.read_event_impl(buf)
    }

The next funtion its calling does not:

fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {
        read_event_impl!(self, buf, self.reader, read_until_open, read_until_close)
    }

In trying to see the effects of the break in the #[inline] propgation, I ran a few tests. I went through and at first added #[inline] to the functions so that there is an end-to-end #[inline] chain, and then later went through and changed it to be #[inline(always)] to see what the differences would be.

I ran lto=true as well to see if there is any benefit to #[inline] or if it just ultimately matches what a manual #[inline] decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.

The xml file was a 1.3GB file from the worksheet of this dataset.

Benchmarks

quick_xml: master branch
  Time (mean ± σ):      5.941 s ±  0.018 s    [User: 5.614 s, System: 0.350 s]
  Range (min … max):    5.915 s …  5.976 s    10 runs

quick_xml: master branch + lto
  Time (mean ± σ):      4.623 s ±  0.049 s    [User: 4.231 s, System: 0.395 s]
  Range (min … max):    4.579 s …  4.719 s    10 runs

// I might have missed coverage for this one, but the main functions were gotten for sure
quick_xml: inline
  Time (mean ± σ):      5.236 s ±  0.240 s    [User: 4.877 s, System: 0.353 s]
  Range (min … max):    5.123 s …  5.916 s    10 runs
  
quick_xml: inline(always)
  Time (mean ± σ):      4.232 s ±  0.022 s    [User: 3.843 s, System: 0.385 s]
  Range (min … max):    4.201 s …  4.255 s    10 runs

quick_xml: inline(always) + lto
  Time (mean ± σ):      4.148 s ±  0.017 s    [User: 3.744 s, System: 0.393 s]
  Range (min … max):    4.116 s …  4.162 s    10 runs

Percentage over master:
image

Seeing that the changes even benefited lto, I think this is a worth while add.

Profile

Current master:
image

#[inline]:
image

#[inline(always)]:
image

#[inline(always)] + lto:
image

One thing I did notice was a memcpy instruction that was 25% of the samples:
image

I'm not sure if this is part of the program or if its just an artifact of the profiling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementhelp wantedoptimizationIssues related to reducing time needed to parse XML or to memory consumption

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions