-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Direct I/O and Transparent HugePages #7420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Am I reading this right? It only took 7s to load? Increased speed by a factor of 3 to 7x? |
Yes, that's correct. (The exact numbers obviously depend on the configuration.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that I understand the purpose of anonymous mmap + direct IO. Is it just to enable huge pages? If so, does that really need to be done through an anonymous mmap? Why not use a normal buffer instead? Is it necessary to pre-populate the entire model when using huge pages in this way? In any case, I think it is confusing to do this under the flag of mmap.
On another node, I found that the readahead size has a huge effect on the performance of MAP_POPULATE
. I use blockdev --setra 10240 /dev/sdx
on startup, and this allows me to load the model at 7GB/s from disk. With the default value, I only get about 2GB/s.
Compared to file-backed mappings, anonymous mappings:
So, the speed-up is not the only improvement of
Compared to the buffers, llama_mmap:
The entire model must be preloaded in essentially one step, one way or another (regardless of THP). When it comes to synchronous reads, it's better to use a very large buffer and coalesce reads. However, the maximum read length is effectively 2 or 4 GB (API limit), so, in contrast to file-backed mmap, it would be possible to subsequently implement the progress callback.
Anonymous mmap is technically also mmap (so
The And, practically, this lets us reuse the existing infrastructure for mappings with minimal modifications. (Thank you for the suggestions! I've addressed the review comments.) |
Technically it is mmap, but in this case mmap is only being used as a memory allocation mechanism, and doesn't provide any of the advantages that users expect from memory mapped files. At this point, it becomes an implementation detail that is not really relevant to the user. It could as well be replaced with a I think a better way to support this might be to read each tensor to a page-aligned buffer and memcpy it to the backend buffer. The tensors are typically large enough that it should be able to achieve full I/O throughput, and the additional memcpy is unlikely to affect the overall performance significantly. What do you think? It would also be possible to modify the buffer/tensor allocation to ensure that all the tensors addresses are page-aligned. |
To the user, the mmap call (rather than the option) is an implementation detail, in the same way mmap in malloc is an implementation detail. The The file-backed mode is still available and is used by default. The
For that particular task, it's not ideal, because it makes inference impractically slow, so there's the mlock workaround. (However, that's somewhat irrelevant—each mode has pros and cons.)
The data in the file is not aligned.
It seems that reusing the existing infrastructure for mappings is the most straightforward, consistent, and technically sound way to implement the |
Thanks, I was confused about the way the
The most common use case where users have less system memory than the model size, is when offloading the model to a GPU. In this case it is still desirable to support direct IO to improve loading time, but loading the entire model to a buffer in one step prevents this.
If there is a reasonable improvement in performance, the tensor alignment of GGUF files can be increased. Generally I would agree with you that reusing the existing infrastructure makes sense, but in this case it comes a significant disadvantage. |
It's required for direct I/O. The page size is dynamic. (And there's just no need to complicate things, given that there's already the simple and effective solution.)
It's not exactly in one step—the GPU context is loaded first and then unmapped, freeing the memory.
It actually fits the task perfectly, because it:
So we can both meet all the requirements and use minimal modifications—a win-win. |
It's not really necessary for direct I/O, it is only necessary to avoid an additional memory copy (you can read the tensor starting at the page boundary and discard the irrelevant part of the first page later).
That's not the case, each file is only mapped once, and from this mapping it is copied to the GPU buffer. At the end, portions of the file that are not necessary in the CPU are unmapped if possible (not on Windows). But you cannot rely on the I already addressed your copy pasted list of reasons, and I already pointed that the only real advantage there is that it can reuse the existing infrastructure. However the amount of code that this is saving is not really significant, and the disadvantages are big enough to be deal breakers. |
Yes, but then the I/O is not truly direct (zero-copy). And that's more complicated.
Every file/memory region is mapped once. However, while file-based mappings are populated completely (on mapping), anonymous mappings are populated selectively, per-context—if you load a model with GPU offloading, there's actually a two-stage progress (but that may depend on the model). Each option has pros and cons and can provide benefits for particular use cases. What's more, each option is probably not ideal in some absolute sense and can be improved. There are various caveat to both mmap and no-mmap mode. That's how software is built—there's always a balance between complexity and functionality. The Even if it's not absolutely perfect, the There seem to be no reason to discard the work that's already been done. We can use the first version to try the concept in practice, get user feedback, and build upon the initial implementation. |
I am not asking you to discard what has been done, I think this is great work in the right direction, but the implementation needs changes. I made a quick and very bad implementation in e9095e6 to test loading each tensor separately with multiple threads, and for me this achieves the same I/O throughput, and overall reduces the load time. Please take a look and consider using a similar approach. |
It seems that it should be possible to implement a more fine-grained unmapping algorithm. And it's probably possible to use That would fully address your concern and would benefit the mmap mode as well. If so, then it makes sense to continue with the existing version, which reuses the mapping infrastructure. |
Alternatively, we may merge the initial version and then build upon it and tweak as we see fit. |
I have only so much time for the contribution, sorry. I would appreciate if you merge the pull request and add any further enhancements later. (But if it can't be merged, so be it.) |
I've added the fine-grained mapping/unmapping and unmapping of memory on Windows. This is a quick implementation, which can be optimized further and needs more testing, but everything seems to work. So it seems possible to implement proper loading/unloading using the existing infrastructure. |
I've added optional per-tensor loading (with detailed progress indicator), which also fits well into the current scheme, with minimal modifications. Without prefetch, the loading is slower but not by that much (by around 10%), because tensor data is relatively large. |
I previously reiterated the items to emphasize that there are real benefits to reusing the existing mmap infrastructure besides mere simplicity, because each of the items:
Even if it could be possible to jump through hoops and implement every item somehow differently, this doesn't mean that there are no benefits to reusing mmap. Reusing mmap lets us satisfy all the requirements in an elegant and efficient way, and if it also lets us use minimal modifications—so much the better. There's no goal to not use mmap or something—the corresponding implementation already exists and is used by default. The fine-grained mapping/unmapping supports efficient partial offloading, including on Windows (and also benefits the regular mmap mode). Both per-tensor and per-context loading modes are supported. So, the solution seems efficient, consistent, and sound. Please review the updated proposal and consider merging the pull request. |
I do not agree that there are any real benefits to reusing the mmap infrastructure. The code is more complicated, and it has significant downsides that I have already discussed. I understand that the latest commit addresses some of them, but that comes at the expense of adding more complexity. I compared the (very bad) implementation that I shared with you before with the latest commit, and despite the very obvious inefficiencies that it has, it achieves the same load time without offloading, and it is about 50% faster when offloading. I also think that it is a simpler implementation, that fits better in the current infrastructure, and would be easier to maintain in the future. I made a brief comment about this earlier that may have been missed, but at least under linux, it is already possible to saturate the disk I/O with mmap by increasing the readahead size. Since we already have a way to achieve the same load performance, I do not think there is any urgency in merging this PR. It's absolutely fine if you are not interested in making the requested changes, but in that case I will retire myself from reviewing this PR. |
The thing with the example is that I/O is not truly direct (zero-copy). It uses only half of the solution (and anonymous mmap is the second half). Avoiding copying in the kernel to then copy in user space defies the purpose of direct I/O. I've tested that version on various files and it shows only minor or no improvement over the standard mmap. That's not quite the same as the 3x speed-up. I've tried the The proposed solution is both efficient and practical, so it may be worth the effort. That said, I've suggested the improvement, it's up to the project to decide. |
It could be that I've explained the reason for reusing mmap a bit vaguely, focusing on specific details. While there are many benefits to using anonymous mmap, the main reason is that this makes direct I/O possible to begin with. It lets us align both memory and reads and avoid copying. Anonymous mmap is not an arbitrary detail; it's an integral part of the solution and complements One of the goals of the existing mmap infrastructure is precisely to avoid copying. Mapping a file into memory and then copying that memory wouldn't make sense and would undermine one of the main benefits of mmap. The same is true for direct I/O. Implementing the required infrastructure from the ground up could have been complicated. But since such an infrastructure already exists, reusing it lets us achieve the goal in an elegant and consistent way. In the case of GPU offloading and insufficient RAM, a file-based mapping with nonselective preloading is not ideal. The OS doesn't know which parts must be loaded first and which parts must be paged out. This may work suboptimally in practice, and users may want to have enough RAM for GPU offloading anyway. Keeping the entire model mapped just for the first and last tensors is also not ideal and can increase pageouts. What's more, it's better to unmap tensors as soon as they are offloaded rather than wait for the entire offloading to complete. The fine-grained mapping/unmapping would let us load/unload necessary parts explicitly and support the particular use case more thoroughly. This also benefits the regular mmap mode, currently in unmapping, but potentially also in preloading, using When looking at the code, keep in mind that some improvements are more general and are not just for direct I/O. As for the direct I/O implementation:
The |
Here's how we can use (The |
I've rebased the changes on top of the latest master branch to resolve the merge conflict. I'm open to amending the code as long as that preserves the performance improvement and keeps the direct I/O direct. That said, the implementation appears balanced and reasonable. The anonymous mmap is a bit unobvious but overall it does make sense. There had been an issue with excessive preloading, but it has now been fixed. The fine-grained preloading is a general improvement and is compatible with possible further enhancements. There are essentially just two new snippets of code: Although it's not absolutely essential, it's a really useful feature that actually makes a difference to the users. Many utilities, such as |
Hello Pavel. |
--direct-io for bypassing page cache (and using THP on Linux) Up to 3-6x faster uncached loading, fewer pageouts, no page cache pollution.
5f097e2
to
c882647
Compare
I've rebased the changes on top of the latest master branch. After ~4 months, it still seems that the feature is useful and the implementation is reasonable while being effective. What's more, it's compatible with the subsequently added and possible further enhancements, including:
These items are a logical continuation of supporting direct I/O, which establishes a good foundation, feature- and implementation-wise (the option is immediately useful; the functions are reusable; the code can adapt to future refactorings). I've addressed the comments, explained the rationale, and clarified the technical details. @slaren @ggerganov Please consider merging the PR so the effort won't go to waste. |
@pavelfatin I don't plan to merge these changes. My understanding, based on the discussion here, is that there are alternative approaches to implement this that would be much easier to maintain and would bring most if not all the advantages of the proposed PR. Since the functionality is not critically important, I think it is better to investigate other simpler implementations. |
Abstract:
--direct-io
for bypassing page cache (and using THP on Linux): up to 3-6x faster uncached loading, fewer pageouts, no page cache pollution.Using LLMs involves large data files: between Llama 3 70B and Mixtral 8x22B, loading 50-100 GB models is now quite common. This requires not just fast CPUs/GPUs and ample RAM, but also fast storage devices.
Fortunately, consumer PCIe 4.0 NVMe SSDs offer read speeds up to 7.5 GB/s, and PCIe 5.0 SSDs—up to 14.5 GB/s—no fancy RAID arrays required! So loading an 80 GB model should take just 5 seconds—problem solved. Right? Not so fast. If you look at
iotop
, the numbers are different from those in CrystalBenchMark screenshots. There are several reasons for that.Llama.cpp is not a typical application because it:
Since operating systems don't normally optimize for this use case, there are two bottlenecks:
Buffered I/O
Regular I/O is buffered and goes through the page cache. This works well for typical workloads—small random repeated reads. But in this case, it's not ideal because:
Memory-mapped I/O
One way to address these shortcomings is to use memory-mapped I/O. The advantages are:
However, memory-mapped I/O is also optimized primarily for small random repeated reads, and also have some drawbacks:
Direct I/O
Another way is to employ direct I/O, which bypasses the page cache entirely. The advantages are:
In contrast to regular disk benchmarks (CrystalDiskMark or
fio
), the program not only reads data but also allocates memory for that data, which can take as much time as reading, using 4K pages. This is not an issue if reading is relatively slow, but if reading is fast, allocation is a significant bottleneck. Linux supports transparent huge pages, which speed up memory allocation and provide additional speed boost. THP is enabled by default for anonymous memory mappings onmadvise
. (THP also seems to increase inference speed by 1% as an added bonus.)The only drawback compared to memory-mapped I/O is that there's no caching between consecutive program runs when loading the same file. (However, even with memory-mapped I/O, program initialization takes some time, and there may be GPU offloading, so it's not instantaneous. Because both methods are fast in absolute terms, in practice, direct I/O can feel almost as fast as pre-cached memory-mapped I/O.)
Use cases
Memory-mapped I/O lets you load an already cached file faster. But if you load a file for the first time (or after some time), load several different files consecutively, or simply want to keep the page cache clean, you may benefit from direct I/O.
The exact gain depends on multiple factors, including:
Technical details
In general, using direct I/O is more complicated because it requires alignment. However, in this case, the implementation perfectly fits the existing infrastructure by introducing anonymous mappings, which allows to align the data and coalesce reads. Non-tensor data is read using buffered I/O and so is cached.
Direct I/O is supported on Linux, Windows, macOS, and FreeBSD. THP is supported on Linux. It's verified that the code compiles and runs on Linux, Windows, and macOS.
To be on the safe side, the
--direct-io
option is currently opt-in, but can subsequently be the default in--no-mmap
mode (with a complementary--no-direct-io
).Benchmarks
When measuring the effect, note that:
Use
echo 1 > /proc/sys/vm/drop_caches
(and possiblyecho 1 > /proc/sys/vm/compact_memory
) on Linux, andpurge
on macOS, or simply reboot the machine.However, you may also test the case when the page cache is completely filled by a different file.
ccache
before the recompilation.)iotop
(or a similar utility) to estimate the reading speed directly.Configuration: Ryzen 9 7950X, 94 GB DDR5-6400, Crucial T705 2TB, Ubuntu 22.04.4, Linux 6.9.1, EXT4.
Model: Mixtral 8x22B Instruct Q4_K_M (85 GB).
The measurements have been taken multiple times, the average has been computed.
--no-mmap
, polluted page cache--no-mmap
, clean page cache--direct-io