Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap: try THP via madvise #113

Open
wants to merge 2 commits into
base: v0.2.2_release
Choose a base branch
from

Conversation

nmanthey
Copy link

Huge pages bring performance benefits for memory intensive applications.
A simple way to use huge pages is by using transparent huge pages. This
can be done by either using statically pre-reserved huge pages, or by
using transparent huge pages.

While some distributions enable transparent huge pages by default, other
distributions chose to allow this feature only when the madvise system
call is used.

This change adds the madvise system call to the memory allocation. On
Unix systems, the invocation of mmap is followed with an madvise system
call that asks the kernel to back the memory with transparent huge
pages, if possible.

Note: this is a prototypical implementation of getting huge page
support. No performance testing has been performed yet. We expect
similar results as reported in https://arxiv.org/abs/2004.14378
Once this data is available, a configuration layer should be added to
be able to disable or enable this change.

Signed-off-by: Norbert Manthey nmanthey@amazon.de

Testing Done

Currently, I only compiled firecracker with the change, and will run it's test suite next and add the results here, as well as look into performance testing.

The test suite mostly passes, except some tests that fail due to time outs, which might depend on me using a mobile CPU for testing.

Update dependencies

firecracker$ cargo update
    Updating crates.io index
    Updating git repository `https://github.com/firecracker-microvm/kvm-bindings`
    Updating git repository `https://github.com/firecracker-microvm/kvm-ioctls`
...
    Updating unicode-xid v0.2.0 -> v0.2.1
      Adding vm-memory v0.2.2 ($HOME/firecracker/local-crates/vm-memory)
    Removing vm-memory v0.2.2
    Updating winapi v0.3.8 -> v0.3.9

Actually Build

firecracker$ tools/devtool build |& tee build.log
[Firecracker devtool] Starting build (debug, musl) ...
 Downloading crates ...
...
   Compiling vmm-sys-util v0.6.1
   Compiling vm-memory v0.2.2 (/firecracker/local-crates/vm-memory)
   Compiling timerfd v1.1.1
...
    Finished dev [unoptimized + debuginfo] target(s) in 38.89s
[Firecracker devtool] Build successful.
[Firecracker devtool] Binaries placed under $HOME/firecracker/build/cargo_target/x86_64-unknown-linux-musl/debug

Required Changes in Firecracker (to pick up this change)

Place this vm-memory repository into "local-crates/vm-memory" in the firecracker repository.

In firecracker, use the following changes on top of the (current) master commit:

firecracker$ git diff HEAD^^
diff --git a/Cargo.toml b/Cargo.toml
index 799712f7..88da28f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,6 +1,9 @@
 [workspace]
 members = ["src/firecracker", "src/jailer"]

+[patch.crates-io]
+vm-memory = { path = 'local-crates/vm-memory' }
+
 [profile.dev]
 panic = "abort"

diff --git a/tools/devtool b/tools/devtool
index 2c2a9551..d870cef5 100755
--- a/tools/devtool
+++ b/tools/devtool
@@ -367,6 +367,7 @@ run_devctr() {
     # Use 'z' on the --volume parameter for docker to automatically relabel the
     # content and allow sharing between containers.
     docker run "${docker_args[@]}" \
+        --network=host \
         --rm \
         --volume /dev:/dev \
         --volume "$FC_ROOT_DIR:$CTR_FC_ROOT_DIR:z" \

Huge pages bring performance benefits for memory intensive applications.
A simple way to use huge pages is by using transparent huge pages. This
can be done by either using statically pre-reserved huge pages, or by
using transparent huge pages.

While some distributions enable transparent huge pages by default, other
distributions chose to allow this feature only when the madvise system
call is used.

This change adds the madvise system call to the memory allocation. On
Unix systems, the invocation of mmap is followed with an madvise system
call that asks the kernel to back the memory with transparent huge
pages, if possible.

Note: this is a prototypical implementation of getting huge page
support. No performance testing has been performed yet. We expect
similar results as reported in https://arxiv.org/abs/2004.14378
Once this data is available, a configuration layer should be added to
be able to disable or enable this change.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
@jiangliu
Copy link
Member

We have done the same optimization out of the vm-memory crate. It's really tricky to deal with preallocation and transparent huge pages. How about left this part to the vmm implementation?

@nmanthey
Copy link
Author

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

src/mmap_unix.rs Outdated
@@ -154,6 +154,8 @@ impl MmapRegion {
return Err(Error::Mmap(io::Error::last_os_error()));
}

let _ret = unsafe { libc::madvise(addr, size, libc::MADV_HUGEPAGE) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Apologies for the late reply :( I was just wondering, what's the difference between using this madvise call and providing MAP_HUGETLB as part of flags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAP_HUGETLB uses hugetlbfs, MADV_HUGEPAGE uses transparent huge page for memfd or anonymous mapping.

@jiangliu
Copy link
Member

out of the vm-memory crate

Can you point to your implementation of this? I'd argue being able to use this in a configurable way right in the crate might be easier to adapt for other consumers.

Can you also propose benchmarks you performed to see the effect of this?

We have done some benchmarks with internal workload, and the improvements depends on several factors. For most cases it could achieve 10-20% improvement for latency.

@alexandruag
Copy link
Collaborator

Thanks for the details! It seems like there are some trade-offs to consider when using THP, so it shouldn't always be enabled. Does that sound right? In that case, since the madvise call happens separately from creating the mapping, it seems the VMM invoking madvise out-of-band like Gerry suggested is definitely an option. Another one is to add a method to GuestRegionMmap (and potentially a top level one to GuestMemoryMmap too) which invokes madvise after the regions have been created. Would be interesting to hear more thoughts here as well.

Allow to control whether we will use huge pages via options.

Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
@nmanthey
Copy link
Author

I added a cherry-picked commit from my current development tree that is based on v0.3.0 of this package. To not move the PR around, I decided to just add the change on top.

The change introduced options in a brief way, without properly forwarding the options to the actual user of the created maps. Do you suggest to continue the effort and lift the options struct to the signature of all relevant functions, so that a caller can always choose how to set them? That seems to result in a lot of code (and interface) changes.

Alternatives I can see would be to have a singleton wrt properties, or have something else that is global, and allows to control the behavior, like environment variables.

I currently prefer a singleton, any comments?

Wrt #120, can we see the change of vmm to use huge pages. I guess, that change does not support using transparteng huge pages, does it?

@alexandruag
Copy link
Collaborator

Hmm, seems like there are several proposed changes in flight that would benefit from a better interface for building GuestMemoryMmap objects (i.e. this PR, #120, #125, not to mention stuff like adding more flexibility when creating file backed mappings). However, I don't think it's super clear at the moment what the right solution is. How about we provide the THP related functionality as part of a method that can be call from the outside for now, and later add an option when revamping the higher level interfaces?

From what I understand, #120 is sort of orthogonal to the transparent huge pages aspect (i.e. regions with THP are not considered to be backed by huge pages w.r.t. the semantics in there), but I might be wrong and hopefully some1 will correct me if that's the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants