mmap: try THP via madvise #113

nmanthey · 2020-09-10T06:41:05Z

Huge pages bring performance benefits for memory intensive applications.
A simple way to use huge pages is by using transparent huge pages. This
can be done by either using statically pre-reserved huge pages, or by
using transparent huge pages.

While some distributions enable transparent huge pages by default, other
distributions chose to allow this feature only when the madvise system
call is used.

This change adds the madvise system call to the memory allocation. On
Unix systems, the invocation of mmap is followed with an madvise system
call that asks the kernel to back the memory with transparent huge
pages, if possible.

Note: this is a prototypical implementation of getting huge page
support. No performance testing has been performed yet. We expect
similar results as reported in https://arxiv.org/abs/2004.14378
Once this data is available, a configuration layer should be added to
be able to disable or enable this change.

Signed-off-by: Norbert Manthey nmanthey@amazon.de

Testing Done

Currently, I only compiled firecracker with the change, and will run it's test suite next and add the results here, as well as look into performance testing.

The test suite mostly passes, except some tests that fail due to time outs, which might depend on me using a mobile CPU for testing.

Update dependencies

firecracker$ cargo update
    Updating crates.io index
    Updating git repository `https://github.com/firecracker-microvm/kvm-bindings`
    Updating git repository `https://github.com/firecracker-microvm/kvm-ioctls`
...
    Updating unicode-xid v0.2.0 -> v0.2.1
      Adding vm-memory v0.2.2 ($HOME/firecracker/local-crates/vm-memory)
    Removing vm-memory v0.2.2
    Updating winapi v0.3.8 -> v0.3.9

Actually Build

firecracker$ tools/devtool build |& tee build.log
[Firecracker devtool] Starting build (debug, musl) ...
 Downloading crates ...
...
   Compiling vmm-sys-util v0.6.1
   Compiling vm-memory v0.2.2 (/firecracker/local-crates/vm-memory)
   Compiling timerfd v1.1.1
...
    Finished dev [unoptimized + debuginfo] target(s) in 38.89s
[Firecracker devtool] Build successful.
[Firecracker devtool] Binaries placed under $HOME/firecracker/build/cargo_target/x86_64-unknown-linux-musl/debug

Required Changes in Firecracker (to pick up this change)

Place this vm-memory repository into "local-crates/vm-memory" in the firecracker repository.

In firecracker, use the following changes on top of the (current) master commit:

firecracker$ git diff HEAD^^
diff --git a/Cargo.toml b/Cargo.toml
index 799712f7..88da28f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,9 @@
 [workspace]
 members = ["src/firecracker", "src/jailer"]

+[patch.crates-io]
+vm-memory = { path = 'local-crates/vm-memory' }
+
 [profile.dev]
 panic = "abort"

diff --git a/tools/devtool b/tools/devtool
index 2c2a9551..d870cef5 100755
--- a/tools/devtool
+++ b/tools/devtool
@@ -367,6 +367,7 @@ run_devctr() {
     # Use 'z' on the --volume parameter for docker to automatically relabel the
     # content and allow sharing between containers.
     docker run "${docker_args[@]}" \
+        --network=host \
         --rm \
         --volume /dev:/dev \
         --volume "$FC_ROOT_DIR:$CTR_FC_ROOT_DIR:z" \

Huge pages bring performance benefits for memory intensive applications. A simple way to use huge pages is by using transparent huge pages. This can be done by either using statically pre-reserved huge pages, or by using transparent huge pages. While some distributions enable transparent huge pages by default, other distributions chose to allow this feature only when the madvise system call is used. This change adds the madvise system call to the memory allocation. On Unix systems, the invocation of mmap is followed with an madvise system call that asks the kernel to back the memory with transparent huge pages, if possible. Note: this is a prototypical implementation of getting huge page support. No performance testing has been performed yet. We expect similar results as reported in https://arxiv.org/abs/2004.14378 Once this data is available, a configuration layer should be added to be able to disable or enable this change. Signed-off-by: Norbert Manthey <nmanthey@amazon.de>

jiangliu · 2020-09-10T06:49:54Z

We have done the same optimization out of the vm-memory crate. It's really tricky to deal with preallocation and transparent huge pages. How about left this part to the vmm implementation?

nmanthey · 2020-09-10T08:55:10Z

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

alexandruag · 2020-09-14T09:43:57Z

src/mmap_unix.rs

            return Err(Error::Mmap(io::Error::last_os_error()));
        }

+        let _ret = unsafe { libc::madvise(addr, size, libc::MADV_HUGEPAGE) };


Hi! Apologies for the late reply :( I was just wondering, what's the difference between using this madvise call and providing MAP_HUGETLB as part of flags?

MAP_HUGETLB uses hugetlbfs, MADV_HUGEPAGE uses transparent huge page for memfd or anonymous mapping.

jiangliu · 2020-09-14T14:15:19Z

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

We have done some benchmarks with internal workload, and the improvements depends on several factors. For most cases it could achieve 10-20% improvement for latency.

alexandruag · 2020-09-14T15:06:46Z

Thanks for the details! It seems like there are some trade-offs to consider when using THP, so it shouldn't always be enabled. Does that sound right? In that case, since the madvise call happens separately from creating the mapping, it seems the VMM invoking madvise out-of-band like Gerry suggested is definitely an option. Another one is to add a method to GuestRegionMmap (and potentially a top level one to GuestMemoryMmap too) which invokes madvise after the regions have been created. Would be interesting to hear more thoughts here as well.

Allow to control whether we will use huge pages via options. Signed-off-by: Norbert Manthey <nmanthey@amazon.de>

nmanthey · 2020-11-19T20:07:37Z

I added a cherry-picked commit from my current development tree that is based on v0.3.0 of this package. To not move the PR around, I decided to just add the change on top.

The change introduced options in a brief way, without properly forwarding the options to the actual user of the created maps. Do you suggest to continue the effort and lift the options struct to the signature of all relevant functions, so that a caller can always choose how to set them? That seems to result in a lot of code (and interface) changes.

Alternatives I can see would be to have a singleton wrt properties, or have something else that is global, and allows to control the behavior, like environment variables.

I currently prefer a singleton, any comments?

Wrt #120, can we see the change of vmm to use huge pages. I guess, that change does not support using transparteng huge pages, does it?

alexandruag · 2020-12-01T19:08:14Z

Hmm, seems like there are several proposed changes in flight that would benefit from a better interface for building GuestMemoryMmap objects (i.e. this PR, #120, #125, not to mention stuff like adding more flexibility when creating file backed mappings). However, I don't think it's super clear at the moment what the right solution is. How about we provide the THP related functionality as part of a method that can be call from the outside for now, and later add an option when revamping the higher level interfaces?

From what I understand, #120 is sort of orthogonal to the transparent huge pages aspect (i.e. regions with THP are not considered to be backed by huge pages w.r.t. the semantics in there), but I might be wrong and hopefully some1 will correct me if that's the case.

nmanthey requested review from alexandruag, bonzini and jiangliu as code owners September 10, 2020 06:41

alexandruag reviewed Sep 14, 2020

View reviewed changes

alexandruag mentioned this pull request Oct 19, 2020

Support MmapRegion with huge pages #118

Open

THP: introduce options

d97ec75

Allow to control whether we will use huge pages via options. Signed-off-by: Norbert Manthey <nmanthey@amazon.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mmap: try THP via madvise #113

mmap: try THP via madvise #113

Uh oh!

nmanthey commented Sep 10, 2020

Uh oh!

jiangliu commented Sep 10, 2020

Uh oh!

nmanthey commented Sep 10, 2020

Uh oh!

alexandruag Sep 14, 2020

Uh oh!

jiangliu Sep 14, 2020

Uh oh!

jiangliu commented Sep 14, 2020

Uh oh!

alexandruag commented Sep 14, 2020

Uh oh!

nmanthey commented Nov 19, 2020

Uh oh!

alexandruag commented Dec 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mmap: try THP via madvise #113

Are you sure you want to change the base?

mmap: try THP via madvise #113

Uh oh!

Conversation

nmanthey commented Sep 10, 2020

Testing Done

Update dependencies

Actually Build

Required Changes in Firecracker (to pick up this change)

Uh oh!

jiangliu commented Sep 10, 2020

Uh oh!

nmanthey commented Sep 10, 2020

Uh oh!

alexandruag Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

jiangliu Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

jiangliu commented Sep 14, 2020

Uh oh!

alexandruag commented Sep 14, 2020

Uh oh!

nmanthey commented Nov 19, 2020

Uh oh!

alexandruag commented Dec 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants