Make server resource file checksumming code memory-efficient #3924
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem description and proposed solution
At the moment, the server resource file checksum calculation code reads the entire file to a memory buffer, which is handed over to a MD5 hasher. Then, the file is copied to the HTTP cache directory by writing the buffer contents to a file in that directory. This algorithm is simple and works well for smaller resource files, but has O(n) space complexity, meaning that it may cause heap or address space exhaustion problems for bigger files, such as IMG archives, textures and sounds, and for resources with many files, which is a realistic concern especially since #3203. In fact, Discord user Ben reported an out of memory server crash stemming from this code when trying to add big IMG archives to a resource.
These changes make the algorithm have constant space complexity by not reading the entire file to a buffer and using the
CChecksum::GenerateChecksumFromFile
method instead. Because no buffer with the file contents is available any longer,FileSize
is used to compute the resource file size hint, andFileCopy
is used to copy the file to the HTTP cache directory.Remarks
The new code should be much more scalable, but has a higher constant overhead due to the additional file operations for fetching its size and copying it. Luckily, opening a file to get its size is a fairly cheap operation in most sane environments, and reading it again for making a copy should hit the OS disk cache in RAM, making it also fairly cheap. All in all, I couldn't measure a negative performance impact for the common case of small resource files on a local filesystem. (Of course, I could indeed measure a favorable difference in memory usage during resource load.)
In my view, a bigger potential concern would be the posibility of TOCTOU bugs that the repeated operations over the same file introduce: an external thread could race against the server thread to swap the contents of the file being checksummed after such checksum is computed, but before it's copied to the HTTP cache directory. After giving this concern some thought, I concluded that, fortunately, it's not a problem in this case because clients would detect a checksum mismatch after downloading the file, refusing to deal with it and thus sidestepping any security concerns. (Conceptually, this is similar to what would already happen if the HTTP cache directory gets corrupted.)
While implementing these changes, I've also simplified a needlessly verbose
if
-else
chain in the client code related to resource file checksum checks, and renamed theCResourceFile
m_uiFileSize
field tom_uiFileSizeHint
to emphasize that, due to TOCTOU races and/or HTTP cache directory poisoning, it may not always be accurate.