Skip to content

Direct I/O and Transparent HugePages #7420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pavelfatin
Copy link

Abstract: --direct-io for bypassing page cache (and using THP on Linux): up to 3-6x faster uncached loading, fewer pageouts, no page cache pollution.

Using LLMs involves large data files: between Llama 3 70B and Mixtral 8x22B, loading 50-100 GB models is now quite common. This requires not just fast CPUs/GPUs and ample RAM, but also fast storage devices.

Fortunately, consumer PCIe 4.0 NVMe SSDs offer read speeds up to 7.5 GB/s, and PCIe 5.0 SSDs—up to 14.5 GB/s—no fancy RAID arrays required! So loading an 80 GB model should take just 5 seconds—problem solved. Right? Not so fast. If you look at iotop, the numbers are different from those in CrystalBenchMark screenshots. There are several reasons for that.

Llama.cpp is not a typical application because it:

  1. reads very large files,
  2. reads the files completely,
  3. reads the files sequentially,
  4. reads the files in one step,
  5. doesn't transform the read data,
  6. both reads data and allocates memory for that data,
  7. keeps all the read data in memory,
  8. often takes almost all available memory.

Since operating systems don't normally optimize for this use case, there are two bottlenecks:

  1. buffered I/O,
  2. virtual memory.

Buffered I/O

Regular I/O is buffered and goes through the page cache. This works well for typical workloads—small random repeated reads. But in this case, it's not ideal because:

  • the data has to be copied between kernel space and user space, which takes extra time,
  • the caching algorithms introduce overhead,
  • the memory might be allocated twice,
  • it causes page cache pollution, and the OS may reclaim pages from elsewhere,
  • unless there's 2x amount of memory, memory also has to be reclaimed,
  • it's only possible to benefit from caching if there's 2x amount of memory,
  • and even then, reading the cached data requires copying.

Memory-mapped I/O

One way to address these shortcomings is to use memory-mapped I/O. The advantages are:

  • there's no double copying,
  • there's no double memory allocation,
  • data can be cached between consecutive program runs.

However, memory-mapped I/O is also optimized primarily for small random repeated reads, and also have some drawbacks:

  • sequential reading of large files may not necessarily be fast (even with prefetching, which is not always available),
  • there's still the general caching subsystem overhead,
  • memory is more likely to be paged out—this doesn't require writing to disk; to the OS it's just "file cache", while the default swappiness on Linux is 60 (this can be addressed by memory locking, but only partially).
  • this still causes page cache pollution, because reading fills up the page cache,
  • it's harder to estimate actual memory use,
  • memory is not immediately reclaimed on exit, which may subsequently slow down other programs (or the same program loading different files),
  • transparent huge pages are not available for regular memory-mapped files.

Direct I/O

Another way is to employ direct I/O, which bypasses the page cache entirely. The advantages are:

  • there's no double copying (DMA to user-space buffers),
  • there's no double memory allocation,
  • reading can be performed as fast as possible by the hardware (provided that the buffer is large),
  • there's no general caching subsystem overhead,
  • reading doesn't affect the system's page cache,
  • data is more likely to be retained in memory, even without locking,
  • it's easy to estimate actual memory use,
  • memory is reclaimed on exit and is available for further use,
  • it's possible to use transparent huge pages.

In contrast to regular disk benchmarks (CrystalDiskMark or fio), the program not only reads data but also allocates memory for that data, which can take as much time as reading, using 4K pages. This is not an issue if reading is relatively slow, but if reading is fast, allocation is a significant bottleneck. Linux supports transparent huge pages, which speed up memory allocation and provide additional speed boost. THP is enabled by default for anonymous memory mappings on madvise. (THP also seems to increase inference speed by 1% as an added bonus.)

The only drawback compared to memory-mapped I/O is that there's no caching between consecutive program runs when loading the same file. (However, even with memory-mapped I/O, program initialization takes some time, and there may be GPU offloading, so it's not instantaneous. Because both methods are fast in absolute terms, in practice, direct I/O can feel almost as fast as pre-cached memory-mapped I/O.)

Use cases

Memory-mapped I/O lets you load an already cached file faster. But if you load a file for the first time (or after some time), load several different files consecutively, or simply want to keep the page cache clean, you may benefit from direct I/O.

The exact gain depends on multiple factors, including:

  • Hardware. The faster your storage device is, the greater the gain you can expect.
  • OS. The effect may be greater on Linux due to THP.
  • Filesystem.
  • Compression.
  • Encryption.
  • File size relative to memory size.

Technical details

In general, using direct I/O is more complicated because it requires alignment. However, in this case, the implementation perfectly fits the existing infrastructure by introducing anonymous mappings, which allows to align the data and coalesce reads. Non-tensor data is read using buffered I/O and so is cached.

Direct I/O is supported on Linux, Windows, macOS, and FreeBSD. THP is supported on Linux. It's verified that the code compiles and runs on Linux, Windows, and macOS.

To be on the safe side, the --direct-io option is currently opt-in, but can subsequently be the default in --no-mmap mode (with a complementary --no-direct-io).

Benchmarks

When measuring the effect, note that:

  • You must clear the page cache before each test (even for direct I/O, because there's also memory allocation).
    Use echo 1 > /proc/sys/vm/drop_caches (and possibly echo 1 > /proc/sys/vm/compact_memory) on Linux, and purge on macOS, or simply reboot the machine.
    However, you may also test the case when the page cache is completely filled by a different file.
  • As a baseline, compile the program for CPU only. (You may need to clean or disable ccache before the recompilation.)
  • Run tests directly rather than in a VM, because the latter adds I/O and memory access translation.
  • Besides the loading time, you may also use iotop (or a similar utility) to estimate the reading speed directly.

Configuration: Ryzen 9 7950X, 94 GB DDR5-6400, Crucial T705 2TB, Ubuntu 22.04.4, Linux 6.9.1, EXT4.

Model: Mixtral 8x22B Instruct Q4_K_M (85 GB).

The measurements have been taken multiple times, the average has been computed.

Mode Time, s Comparison
--no-mmap, polluted page cache 47.3 6.5
--no-mmap, clean page cache 46.2 6.3
mmap, polluted page cache 22.3 3.1
mmap, clean page cache 20.8 2.9
--direct-io 7.3 1.0
mmap, cached 4.5 0.6

@github-actions github-actions bot added script Script related examples python python script changes server labels May 20, 2024
@teleprint-me
Copy link
Contributor

teleprint-me commented May 20, 2024

Am I reading this right? It only took 7s to load? Increased speed by a factor of 3 to 7x?

Copy link
Contributor

github-actions bot commented May 20, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8429.03ms p(95)=20077.14ms fails=, finish reason: stop=501 truncated=54
  • Prompt processing (pp): avg=102.18tk/s p(95)=489.97tk/s
  • Token generation (tg): avg=34.46tk/s p(95)=47.62tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=direct-io commit=5f097e2485e864e79e9cddbc9c0e037ee1e7095b

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717613369 --> 1717613997
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 287.58, 287.58, 287.58, 287.58, 287.58, 616.17, 616.17, 616.17, 616.17, 616.17, 499.07, 499.07, 499.07, 499.07, 499.07, 534.84, 534.84, 534.84, 534.84, 534.84, 613.38, 613.38, 613.38, 613.38, 613.38, 640.21, 640.21, 640.21, 640.21, 640.21, 645.92, 645.92, 645.92, 645.92, 645.92, 683.72, 683.72, 683.72, 683.72, 683.72, 689.7, 689.7, 689.7, 689.7, 689.7, 711.8, 711.8, 711.8, 711.8, 711.8, 740.6, 740.6, 740.6, 740.6, 740.6, 785.72, 785.72, 785.72, 785.72, 785.72, 811.65, 811.65, 811.65, 811.65, 811.65, 834.38, 834.38, 834.38, 834.38, 834.38, 842.97, 842.97, 842.97, 842.97, 842.97, 836.6, 836.6, 836.6, 836.6, 836.6, 836.92, 836.92, 836.92, 836.92, 836.92, 852.39, 852.39, 852.39, 852.39, 852.39, 861.65, 861.65, 861.65, 861.65, 861.65, 861.05, 861.05, 861.05, 861.05, 861.05, 868.07, 868.07, 868.07, 868.07, 868.07, 868.97, 868.97, 868.97, 868.97, 868.97, 888.68, 888.68, 888.68, 888.68, 888.68, 888.71, 888.71, 888.71, 888.71, 888.71, 890.66, 890.66, 890.66, 890.66, 890.66, 893.61, 893.61, 893.61, 893.61, 893.61, 892.3, 892.3, 892.3, 892.3, 892.3, 894.32, 894.32, 894.32, 894.32, 894.32, 896.81, 896.81, 896.81, 896.81, 896.81, 892.02, 892.02, 892.02, 892.02, 892.02, 896.25, 896.25, 896.25, 896.25, 896.25, 899.31, 899.31, 899.31, 899.31, 899.31, 909.58, 909.58, 909.58, 909.58, 909.58, 909.34, 909.34, 909.34, 909.34, 909.34, 894.5, 894.5, 894.5, 894.5, 894.5, 888.73, 888.73, 888.73, 888.73, 888.73, 889.06, 889.06, 889.06, 889.06, 889.06, 892.33, 892.33, 892.33, 892.33, 892.33, 891.84, 891.84, 891.84, 891.84, 891.84, 898.17, 898.17, 898.17, 898.17, 898.17, 902.1, 902.1, 902.1, 902.1, 902.1, 899.24, 899.24, 899.24, 899.24, 899.24, 897.39, 897.39, 897.39, 897.39, 897.39, 894.74, 894.74, 894.74, 894.74, 894.74, 892.54, 892.54, 892.54, 892.54, 892.54, 891.65, 891.65, 891.65, 891.65, 891.65, 891.51, 891.51, 891.51, 891.51, 891.51, 891.32, 891.32, 891.32, 891.32, 891.32, 894.17, 894.17, 894.17, 894.17, 894.17, 894.02, 894.02, 894.02, 894.02, 894.02, 896.13, 896.13, 896.13, 896.13, 896.13, 896.44, 896.44, 896.44, 896.44, 896.44, 899.42, 899.42, 899.42, 899.42, 899.42, 899.23, 899.23, 899.23, 899.23, 899.23, 898.74, 898.74, 898.74, 898.74, 898.74, 898.43, 898.43, 898.43, 898.43, 898.43, 899.85, 899.85, 899.85, 899.85, 899.85, 900.56, 900.56, 900.56, 900.56, 900.56, 901.31, 901.31, 901.31, 901.31, 901.31, 902.36, 902.36, 902.36, 902.36, 902.36, 902.8, 902.8, 902.8, 902.8, 902.8]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717613369 --> 1717613997
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 39.62, 39.62, 39.62, 39.62, 39.62, 43.42, 43.42, 43.42, 43.42, 43.42, 29.38, 29.38, 29.38, 29.38, 29.38, 32.95, 32.95, 32.95, 32.95, 32.95, 33.72, 33.72, 33.72, 33.72, 33.72, 34.05, 34.05, 34.05, 34.05, 34.05, 35.06, 35.06, 35.06, 35.06, 35.06, 35.66, 35.66, 35.66, 35.66, 35.66, 35.96, 35.96, 35.96, 35.96, 35.96, 36.1, 36.1, 36.1, 36.1, 36.1, 36.09, 36.09, 36.09, 36.09, 36.09, 35.66, 35.66, 35.66, 35.66, 35.66, 34.78, 34.78, 34.78, 34.78, 34.78, 34.77, 34.77, 34.77, 34.77, 34.77, 33.8, 33.8, 33.8, 33.8, 33.8, 31.32, 31.32, 31.32, 31.32, 31.32, 31.21, 31.21, 31.21, 31.21, 31.21, 31.52, 31.52, 31.52, 31.52, 31.52, 31.2, 31.2, 31.2, 31.2, 31.2, 31.18, 31.18, 31.18, 31.18, 31.18, 31.22, 31.22, 31.22, 31.22, 31.22, 31.44, 31.44, 31.44, 31.44, 31.44, 31.27, 31.27, 31.27, 31.27, 31.27, 31.33, 31.33, 31.33, 31.33, 31.33, 31.59, 31.59, 31.59, 31.59, 31.59, 31.7, 31.7, 31.7, 31.7, 31.7, 31.82, 31.82, 31.82, 31.82, 31.82, 32.1, 32.1, 32.1, 32.1, 32.1, 32.18, 32.18, 32.18, 32.18, 32.18, 32.19, 32.19, 32.19, 32.19, 32.19, 32.38, 32.38, 32.38, 32.38, 32.38, 32.42, 32.42, 32.42, 32.42, 32.42, 32.29, 32.29, 32.29, 32.29, 32.29, 32.18, 32.18, 32.18, 32.18, 32.18, 31.32, 31.32, 31.32, 31.32, 31.32, 31.34, 31.34, 31.34, 31.34, 31.34, 31.62, 31.62, 31.62, 31.62, 31.62, 31.84, 31.84, 31.84, 31.84, 31.84, 31.88, 31.88, 31.88, 31.88, 31.88, 31.94, 31.94, 31.94, 31.94, 31.94, 31.5, 31.5, 31.5, 31.5, 31.5, 31.28, 31.28, 31.28, 31.28, 31.28, 31.01, 31.01, 31.01, 31.01, 31.01, 29.88, 29.88, 29.88, 29.88, 29.88, 29.23, 29.23, 29.23, 29.23, 29.23, 29.3, 29.3, 29.3, 29.3, 29.3, 29.25, 29.25, 29.25, 29.25, 29.25, 29.29, 29.29, 29.29, 29.29, 29.29, 29.36, 29.36, 29.36, 29.36, 29.36, 29.43, 29.43, 29.43, 29.43, 29.43, 29.46, 29.46, 29.46, 29.46, 29.46, 29.46, 29.46, 29.46, 29.46, 29.46, 29.34, 29.34, 29.34, 29.34, 29.34, 29.19, 29.19, 29.19, 29.19, 29.19, 29.2, 29.2, 29.2, 29.2, 29.2, 29.4, 29.4, 29.4, 29.4, 29.4, 29.47, 29.47, 29.47, 29.47, 29.47, 29.63, 29.63, 29.63, 29.63, 29.63, 29.68, 29.68, 29.68, 29.68, 29.68, 29.66, 29.66, 29.66, 29.66, 29.66, 29.68, 29.68, 29.68, 29.68, 29.68]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717613369 --> 1717613997
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.37, 0.37, 0.37, 0.37, 0.37, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.4, 0.4, 0.4, 0.4, 0.4, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.34, 0.34, 0.34, 0.34, 0.34, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.52, 0.52, 0.52, 0.52, 0.52, 0.68, 0.68, 0.68, 0.68, 0.68, 0.63, 0.63, 0.63, 0.63, 0.63, 0.41, 0.41, 0.41, 0.41, 0.41, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717613369 --> 1717613997
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@pavelfatin
Copy link
Author

Am I reading this right? It only took 7s to load? Increased speed by a factor of 3 to 7x?

Yes, that's correct. (The exact numbers obviously depend on the configuration.)

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 21, 2024
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that I understand the purpose of anonymous mmap + direct IO. Is it just to enable huge pages? If so, does that really need to be done through an anonymous mmap? Why not use a normal buffer instead? Is it necessary to pre-populate the entire model when using huge pages in this way? In any case, I think it is confusing to do this under the flag of mmap.

On another node, I found that the readahead size has a huge effect on the performance of MAP_POPULATE. I use blockdev --setra 10240 /dev/sdx on startup, and this allows me to load the model at 7GB/s from disk. With the default value, I only get about 2GB/s.

@pavelfatin
Copy link
Author

I am not sure that I understand the purpose of anonymous mmap + direct IO. Is it just to enable huge pages?

Compared to file-backed mappings, anonymous mappings:

  • Can be read as quickly as possible regardless of the speed of prefetch, which might also be unavailable.
  • Don't incur the general caching overhead.
  • Don't pollute the system's page cache.
  • Free the memory on program exit.
  • Result in fewer pageouts (according to swappiness).
  • Better reflect actual memory use.
  • Would let us subsequently provide the progress callback.
  • Let us enable THP, which is available for private anonymous mappings on madvise.

So, the speed-up is not the only improvement of --direct-io.

does that really need to be done through an anonymous mmap? Why not use a normal buffer instead?

Compared to the buffers, llama_mmap:

  • Provides memory alignment (which is required for direct I/O).
  • Provides a continuous section of memory regardless of the buffer API and GPU offloading.
  • Lets us align read offsets and lengths, which are not aligned in the file.
  • Lets us coalesce reads and use the large buffer.
  • Lets us reuse the already existing infrastructure for mapping/unmapping.

Is it necessary to pre-populate the entire model when using huge pages in this way?

The entire model must be preloaded in essentially one step, one way or another (regardless of THP). When it comes to synchronous reads, it's better to use a very large buffer and coalesce reads. However, the maximum read length is effectively 2 or 4 GB (API limit), so, in contrast to file-backed mmap, it would be possible to subsequently implement the progress callback.

it is confusing to do this under the flag of mmap

Anonymous mmap is technically also mmap (so llama_anonymous_mmmap is a llama_mmap), and there's the comment to clarify this:

// either file or anonymous mappings
this->use_mmap = use_mmap || use_direct_io;
this->use_direct_io = use_direct_io;

The init_mappings has the bool anonymous parameter to also make it more clear. However, for the code that uses mappings, a mapping is just a mapping—there's no essential difference whether it's file-backed or anonymous.

And, practically, this lets us reuse the existing infrastructure for mappings with minimal modifications.

(Thank you for the suggestions! I've addressed the review comments.)

@slaren
Copy link
Member

slaren commented May 21, 2024

Technically it is mmap, but in this case mmap is only being used as a memory allocation mechanism, and doesn't provide any of the advantages that users expect from memory mapped files. At this point, it becomes an implementation detail that is not really relevant to the user. It could as well be replaced with a posix_memalign and it would achieve the same result, other than being able to reuse the existing infrastructure. However, I am not convinced that reusing it is a good idea. Generally, memory mapping the entire file and reading it at once with MAP_POPULATE, even when offloading, is ok because the OS can use the file as the backing on disk of the data if there is not enough physical memory to hold the entire model file in memory. With an anonymous map, this is no longer the case, and the allocation either will fail or it will be backed by the swap.

I think a better way to support this might be to read each tensor to a page-aligned buffer and memcpy it to the backend buffer. The tensors are typically large enough that it should be able to achieve full I/O throughput, and the additional memcpy is unlikely to affect the overall performance significantly. What do you think?

It would also be possible to modify the buffer/tensor allocation to ensure that all the tensors addresses are page-aligned.

@pavelfatin
Copy link
Author

it becomes an implementation detail that is not really relevant to the user

To the user, the mmap call (rather than the option) is an implementation detail, in the same way mmap in malloc is an implementation detail. The --direct-io option doesn't say that it's implemented via mmap, so users shouldn't expect the behavior of memory-mapped files in such a case. The --direct-io option is supposed to behave differently.

The file-backed mode is still available and is used by default. The --direct-io option is opt-in. So there's no surprise.

the OS can use the file as the backing on disk of the data if there is not enough physical memory to hold the entire model file in memory

For that particular task, it's not ideal, because it makes inference impractically slow, so there's the mlock workaround. (However, that's somewhat irrelevant—each mode has pros and cons.)

It would also be possible to modify the buffer/tensor allocation to ensure that all the tensors addresses are page-aligned.

The data in the file is not aligned.

What do you think?

It seems that reusing the existing infrastructure for mappings is the most straightforward, consistent, and technically sound way to implement the --direct-io option. We should differentiate between user-level options and implementation details. It's probably possible to implement a completely separate mechanism and overcome various technical obstacles somehow. However, there's already the existing mechanism for that that fits the task perfectly.

@slaren
Copy link
Member

slaren commented May 21, 2024

To the user, the mmap call (rather than the option) is an implementation detail, in the same way mmap in malloc is an implementation detail. The --direct-io option doesn't say that it's implemented via mmap, so users shouldn't expect the behavior of memory-mapped files in such a case. The --direct-io option is supposed to behave differently.

Thanks, I was confused about the way the use_direct_io option interacts with use_mmap. It seems to override it completely.

For that particular task, it's not ideal, because it makes inference impractically slow, so there's the mlock workaround. (However, that's somewhat irrelevant—each mode has pros and cons.)

The most common use case where users have less system memory than the model size, is when offloading the model to a GPU. In this case it is still desirable to support direct IO to improve loading time, but loading the entire model to a buffer in one step prevents this.

The data in the file is not aligned.

If there is a reasonable improvement in performance, the tensor alignment of GGUF files can be increased.

Generally I would agree with you that reusing the existing infrastructure makes sense, but in this case it comes a significant disadvantage.

@pavelfatin
Copy link
Author

If there is a reasonable improvement in performance, the tensor alignment of GGUF files can be increased.

It's required for direct I/O. The page size is dynamic. (And there's just no need to complicate things, given that there's already the simple and effective solution.)

In this case it is still desirable to support direct IO to improve loading time, but loading the entire model to a buffer in one step prevents this.

It's not exactly in one step—the GPU context is loaded first and then unmapped, freeing the memory.

Generally I would agree with you that reusing the existing infrastructure makes sense, but in this case it comes a significant disadvantage.

It actually fits the task perfectly, because it:

  • Provides memory alignment (which is required for direct I/O).
  • Provides a continuous section of memory regardless of the buffer API and GPU offloading.
  • Lets us align read offsets and lengths (required for direct I/O), which are not aligned in the file.
  • Lets us coalesce reads and use the large buffer.
  • Lets us reuse the already existing infrastructure for mapping/unmapping.

So we can both meet all the requirements and use minimal modifications—a win-win.

@slaren
Copy link
Member

slaren commented May 21, 2024

It's not really necessary for direct I/O, it is only necessary to avoid an additional memory copy (you can read the tensor starting at the page boundary and discard the irrelevant part of the first page later).

It's not exactly in one step—the GPU context is loaded first and then unmapped, freeing the memory.

That's not the case, each file is only mapped once, and from this mapping it is copied to the GPU buffer. At the end, portions of the file that are not necessary in the CPU are unmapped if possible (not on Windows). But you cannot rely on the first and last offsets to load only the data necessary for the context. With partial offloading, very often the entire model is kept mapped on the CPU because both the first and last tensors on the model file are kept on the CPU, even if you are offloading every layer except one. When using mmap, this is mostly ok because we can assume that the OS will evict the memory mapped file from physical memory with very little overhead if necessary. This is not the case with an anonymous map.

I already addressed your copy pasted list of reasons, and I already pointed that the only real advantage there is that it can reuse the existing infrastructure. However the amount of code that this is saving is not really significant, and the disadvantages are big enough to be deal breakers.

@pavelfatin
Copy link
Author

It's not really necessary for direct I/O, it is only necessary to avoid an additional memory copy

Yes, but then the I/O is not truly direct (zero-copy). And that's more complicated.

That's not the case, each file is only mapped once, and from this mapping it is copied to the GPU buffer.

Every file/memory region is mapped once. However, while file-based mappings are populated completely (on mapping), anonymous mappings are populated selectively, per-context—if you load a model with GPU offloading, there's actually a two-stage progress (but that may depend on the model).

Each option has pros and cons and can provide benefits for particular use cases. What's more, each option is probably not ideal in some absolute sense and can be improved. There are various caveat to both mmap and no-mmap mode. That's how software is built—there's always a balance between complexity and functionality.

The --direct-io option has already been implemented, requires minimal modifications, is opt-in, works well for many use cases, and can offer immediate benefit to the users.

Even if it's not absolutely perfect, the --direct-io option meets the criteria for usable MVP. It adds more to the project than subtracts. Since the option is opt-in, no user experience is disrupted. Since the modifications are minimal, the code is mostly the same.

There seem to be no reason to discard the work that's already been done. We can use the first version to try the concept in practice, get user feedback, and build upon the initial implementation.

@slaren
Copy link
Member

slaren commented May 21, 2024

I am not asking you to discard what has been done, I think this is great work in the right direction, but the implementation needs changes. I made a quick and very bad implementation in e9095e6 to test loading each tensor separately with multiple threads, and for me this achieves the same I/O throughput, and overall reduces the load time. Please take a look and consider using a similar approach.

@pavelfatin
Copy link
Author

It seems that it should be possible to implement a more fine-grained unmapping algorithm.

And it's probably possible to use OfferVirtualMemory for partial "unmaping" of memory on Windows, using the same technique as for PrefetchVirtualMemory.

That would fully address your concern and would benefit the mmap mode as well.

If so, then it makes sense to continue with the existing version, which reuses the mapping infrastructure.

@pavelfatin
Copy link
Author

Alternatively, we may merge the initial version and then build upon it and tweak as we see fit.

@pavelfatin
Copy link
Author

I have only so much time for the contribution, sorry. I would appreciate if you merge the pull request and add any further enhancements later. (But if it can't be merged, so be it.)

@pavelfatin
Copy link
Author

I've added the fine-grained mapping/unmapping and unmapping of memory on Windows. This is a quick implementation, which can be optimized further and needs more testing, but everything seems to work. So it seems possible to implement proper loading/unloading using the existing infrastructure.

@pavelfatin
Copy link
Author

I've added optional per-tensor loading (with detailed progress indicator), which also fits well into the current scheme, with minimal modifications.

Without prefetch, the loading is slower but not by that much (by around 10%), because tensor data is relatively large.

@pavelfatin
Copy link
Author

I previously reiterated the items to emphasize that there are real benefits to reusing the existing mmap infrastructure besides mere simplicity, because each of the items:

  • is either required or desirable,
  • is not available by default otherwise,
  • is provided by the mmap infrastructure "as is", for free.

Even if it could be possible to jump through hoops and implement every item somehow differently, this doesn't mean that there are no benefits to reusing mmap. Reusing mmap lets us satisfy all the requirements in an elegant and efficient way, and if it also lets us use minimal modifications—so much the better. There's no goal to not use mmap or something—the corresponding implementation already exists and is used by default.

The fine-grained mapping/unmapping supports efficient partial offloading, including on Windows (and also benefits the regular mmap mode). Both per-tensor and per-context loading modes are supported.

So, the solution seems efficient, consistent, and sound.

Please review the updated proposal and consider merging the pull request.

@slaren
Copy link
Member

slaren commented May 23, 2024

I do not agree that there are any real benefits to reusing the mmap infrastructure. The code is more complicated, and it has significant downsides that I have already discussed. I understand that the latest commit addresses some of them, but that comes at the expense of adding more complexity. I compared the (very bad) implementation that I shared with you before with the latest commit, and despite the very obvious inefficiencies that it has, it achieves the same load time without offloading, and it is about 50% faster when offloading. I also think that it is a simpler implementation, that fits better in the current infrastructure, and would be easier to maintain in the future.

I made a brief comment about this earlier that may have been missed, but at least under linux, it is already possible to saturate the disk I/O with mmap by increasing the readahead size. Since we already have a way to achieve the same load performance, I do not think there is any urgency in merging this PR. It's absolutely fine if you are not interested in making the requested changes, but in that case I will retire myself from reviewing this PR.

@pavelfatin
Copy link
Author

The thing with the example is that I/O is not truly direct (zero-copy). It uses only half of the solution (and anonymous mmap is the second half). Avoiding copying in the kernel to then copy in user space defies the purpose of direct I/O. I've tested that version on various files and it shows only minor or no improvement over the standard mmap. That's not quite the same as the 3x speed-up.

I've tried the blockdev --setra 10240 /dev/sdx command (using echo 1 > /proc/sys/vm/drop_caches between the tests) and it doesn't seem to do much (~5%). Besides, it has to be run manually, requires administrative privileges, and affects the whole disk/system. That's not practical (and is not directly related).

The proposed solution is both efficient and practical, so it may be worth the effort.

That said, I've suggested the improvement, it's up to the project to decide.

@pavelfatin
Copy link
Author

It could be that I've explained the reason for reusing mmap a bit vaguely, focusing on specific details. While there are many benefits to using anonymous mmap, the main reason is that this makes direct I/O possible to begin with. It lets us align both memory and reads and avoid copying. Anonymous mmap is not an arbitrary detail; it's an integral part of the solution and complements read_direct.

One of the goals of the existing mmap infrastructure is precisely to avoid copying. Mapping a file into memory and then copying that memory wouldn't make sense and would undermine one of the main benefits of mmap. The same is true for direct I/O. Implementing the required infrastructure from the ground up could have been complicated. But since such an infrastructure already exists, reusing it lets us achieve the goal in an elegant and consistent way.

In the case of GPU offloading and insufficient RAM, a file-based mapping with nonselective preloading is not ideal. The OS doesn't know which parts must be loaded first and which parts must be paged out. This may work suboptimally in practice, and users may want to have enough RAM for GPU offloading anyway. Keeping the entire model mapped just for the first and last tensors is also not ideal and can increase pageouts. What's more, it's better to unmap tensors as soon as they are offloaded rather than wait for the entire offloading to complete. The fine-grained mapping/unmapping would let us load/unload necessary parts explicitly and support the particular use case more thoroughly. This also benefits the regular mmap mode, currently in unmapping, but potentially also in preloading, using madvise(.., MADV_POPULATE_READ) in populate, which would perfectly fit into the existing scheme.

When looking at the code, keep in mind that some improvements are more general and are not just for direct I/O. As for the direct I/O implementation:

  • it doesn't add extra libraries,
  • it doesn't add extra source files,
  • it doesn't add extra headers,
  • it doesn't modify the API,
  • it doesn't change the architecture in general,
  • it's consistent with the existing infrastructure,
  • it's relatively small and localized to model loading.

The --direct-io option is opt-in and can offer faster uncached loading, fewer pageouts, and no page cache pollution. This may benefit users who want to quickly load a model for the first time (or after some time), load several different models consecutively (including in scripts), or keep the page cache clean. The implementation is not that complicated, while the feature can make a real difference in many use cases. The option may become increasingly relevant as models become ever larger and storage devices become faster.

@pavelfatin
Copy link
Author

Here's how we can use get_mapping_ranges and populate to implement per-context loading for regular mmap, with minimal modifications: f096f5d.

(The MADV_POPULATE_READ API is probably too new for now, but this shows that it would be fully compatible with the existing infrastructure.)

@pavelfatin
Copy link
Author

I've rebased the changes on top of the latest master branch to resolve the merge conflict.

I'm open to amending the code as long as that preserves the performance improvement and keeps the direct I/O direct.

That said, the implementation appears balanced and reasonable. The anonymous mmap is a bit unobvious but overall it does make sense. There had been an issue with excessive preloading, but it has now been fixed. The fine-grained preloading is a general improvement and is compatible with possible further enhancements.

There are essentially just two new snippets of code: read_direct and llama_anonymous_mmap, which are coherent and self-contained, so it should be relatively easy to maintain and refactor the code.

Although it's not absolutely essential, it's a really useful feature that actually makes a difference to the users. Many utilities, such as dd, losetup, and pv already support --direct-io, so the option would be familiar and consistent.

@Nexesenex
Copy link
Contributor

Hello Pavel.
Could you rebase your PR up to the current master?
I'd be interested to use it in my fork of KoboldCPP.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jul 24, 2024
--direct-io for bypassing page cache (and using THP on Linux)

Up to 3-6x faster uncached loading, fewer pageouts, no page cache pollution.
@pavelfatin
Copy link
Author

I've rebased the changes on top of the latest master branch.

After ~4 months, it still seems that the feature is useful and the implementation is reasonable while being effective.

What's more, it's compatible with the subsequently added and possible further enhancements, including:

  • Implement non-mapped async IO for CUDA on Windows.  #7896

    This would benefit from direct I/O (read_direct) and selective preloading (either per-context or per-tensor).

    And it would be possible to implement this for all the 3 I/O modes (buffered, memory-mapped, direct).

  • WIP: Use DirectStorage with CUDA interop to more efficient load tensors #7796

    Direct I/O for CPU is direct-to-RAM; direct I/O for GPU is direct-to-VRAM. They behave similarly with respect to OS caching. Conceptually, they are the same feature, --direct-io.

    DirectStorage is only for GPUs and only for some hardware/software configurations. Besides, very large models are often loaded into RAM rather than VRAM. So the regular direct I/O is always good to have, both for direct "CPU I/O" and for more direct "GPU I/O" when DirectStorage is not available.

    This would also benefit from selective preloading.

These items are a logical continuation of supporting direct I/O, which establishes a good foundation, feature- and implementation-wise (the option is immediately useful; the functions are reusable; the code can adapt to future refactorings).

I've addressed the comments, explained the rationale, and clarified the technical details.

@slaren @ggerganov Please consider merging the PR so the effort won't go to waste.

@ggerganov
Copy link
Member

@pavelfatin I don't plan to merge these changes. My understanding, based on the discussion here, is that there are alternative approaches to implement this that would be much easier to maintain and would bring most if not all the advantages of the proposed PR. Since the functionality is not critically important, I think it is better to investigate other simpler implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level script Script related server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants