Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make server resource file checksumming code memory-efficient #3924

Merged

Conversation

AlexTMjugador
Copy link
Member

@AlexTMjugador AlexTMjugador commented Jan 4, 2025

Problem description and proposed solution

At the moment, the server resource file checksum calculation code reads the entire file to a memory buffer, which is handed over to a MD5 hasher. Then, the file is copied to the HTTP cache directory by writing the buffer contents to a file in that directory. This algorithm is simple and works well for smaller resource files, but has O(n) space complexity, meaning that it may cause heap or address space exhaustion problems for bigger files, such as IMG archives, textures and sounds, and for resources with many files, which is a realistic concern especially since #3203. In fact, Discord user Ben reported an out of memory server crash stemming from this code when trying to add big IMG archives to a resource.

These changes make the algorithm have constant space complexity by not reading the entire file to a buffer and using the
CChecksum::GenerateChecksumFromFile method instead. Because no buffer with the file contents is available any longer, FileSize is used to compute the resource file size hint, and FileCopy is used to copy the file to the HTTP cache directory.

Remarks

The new code should be much more scalable, but has a higher constant overhead due to the additional file operations for fetching its size and copying it. Luckily, opening a file to get its size is a fairly cheap operation in most sane environments, and reading it again for making a copy should hit the OS disk cache in RAM, making it also fairly cheap. All in all, I couldn't measure a negative performance impact for the common case of small resource files on a local filesystem. (Of course, I could indeed measure a favorable difference in memory usage during resource load.)

In my view, a bigger potential concern would be the posibility of TOCTOU bugs that the repeated operations over the same file introduce: an external thread could race against the server thread to swap the contents of the file being checksummed after such checksum is computed, but before it's copied to the HTTP cache directory. After giving this concern some thought, I concluded that, fortunately, it's not a problem in this case because clients would detect a checksum mismatch after downloading the file, refusing to deal with it and thus sidestepping any security concerns. (Conceptually, this is similar to what would already happen if the HTTP cache directory gets corrupted.)

While implementing these changes, I've also simplified a needlessly verbose if-else chain in the client code related to resource file checksum checks, and renamed the CResourceFile m_uiFileSize field to m_uiFileSizeHint to emphasize that, due to TOCTOU races and/or HTTP cache directory poisoning, it may not always be accurate.

At the moment, the server resource file checksum calculation code reads
the entire file to a memory buffer, which is then handed over to a MD5
hasher. Then, the file is copied to the HTTP cache directory by writing
the buffer contents to a file in that directory. This algorithm is
simple and works well for smaller resource files, but has O(n) space
complexity, meaning that it may cause heap or address space exhaustion
problems for bigger files, such as IMG archives, textures and sounds,
and for resources with many files, which is a realistic concern
especially since multitheftauto#3203. In fact, Discord user _Ben_ reported an out of
memory server crash stemming from this code when trying to add big IMG
archives to a resource.

These changes make the algorithm have constant space complexity by not
reading the file to a buffer and using the
`CChecksum::GenerateChecksumFromFile` method instead. Because no buffer
with the file contents is available any longer, `FileSize` is used to
compute the resource file size hint, and `FileCopy` is used to copy the
file to the HTTP cache directory.

Thus, the new code should be much more scalable, but has a higher
constant overhead due to the additional file operations for fetching its
size and copying it. Luckily, opening a file to get its size is a fairly
cheap operation in most sane environments, and reading it again for
making a copy should hit the OS disk cache in RAM, making it also fairly
cheap. All in all, I couldn't measure a negative performance impact
for the common case of small resource files on a local filesystem. (Of
course, I could indeed measure a favorable difference in memory usage
during resource load.)

In my view, a bigger potential concern would be the posibility of TOCTOU
bugs that the repeated operations over the same file introduce: an
external thread could race against the server thread to swap the
contents of the file being checksummed after such checksum is computed,
but before it's copied to the HTTP cache directory. After giving this
concern some thought, I concluded that, fortunately, it's not a problem
in this case because clients would detect a checksum mismatch after
downloading the file, refusing to deal with it and thus sidestepping any
security concerns. (Conceptually, this is similar to what would already
happen if the HTTP cache directory gets corrupted.)
@botder botder merged commit cb90339 into multitheftauto:master Jan 4, 2025
6 checks passed
@botder botder added the enhancement New feature or request label Jan 4, 2025
@botder botder added this to the 1.6.1 milestone Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants