Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corruption of files during system power loss or system crash #98360

Open
jadoc opened this issue Oct 20, 2024 · 1 comment · May be fixed by #98361
Open

Corruption of files during system power loss or system crash #98360

jadoc opened this issue Oct 20, 2024 · 1 comment · May be fixed by #98361

Comments

@jadoc
Copy link
Contributor

jadoc commented Oct 20, 2024

Tested versions

Reproducible in 4.3 stable and 4.4 at HEAD. Looking at the source history, I believe this bug has existed since at least 2014.

System information

Windows 11 and Ubuntu 24

Issue description

Godot's FileAccess is used to both save resources in the editor and to save game state by game developers. To reduce the risk of files being left in an intermediate state in the event of an error, FileAccess is able to write to a temporary file, then moves that file on top of the existing file. This is the default behavior in the editor and in any game where OS.set_use_file_access_save_and_swap(true) is used. While this is good enough to protect against errors and crashes in Godot itself, it does not provide an atomic operation that protects against power loss or crash of the operating system.

Before renaming the temporary file, it's essential to ensure that the newly written contents have actually been committed to the underlying storage and aren't still sitting in the OS buffers. Otherwise, the effects of the rename operation may be written to disk before the contents of the file. Power loss or OS crash during this state could leave a partially written file in place of the original, with no direct way to recover the original.

On POSIX systems, commit to underlying storage can be accomplished with the fsync() system call. When called with a file descriptor, fsync() will block until all outstanding writes associated with the files descriptor have been acknowledged by the underlying storage device as being stable against power loss. On Windows, the equivalent of fsync() is FlushFileBuffers().

Note that fflush() is distinct from fsync(). The former operates between the process and the operating system, and the later between the operating system and the storage device.

Examples of other libraries and applications properly using fsync() after writing to a temporary file but before renaming it:

I noticed this problem with scrolling on Reddit. One post told the story of how a project was ruined because their computer lost power while saving. Immediately upon reading, it stuck me as a classic case of failing to sync the filesystem when attempting to do atomic writes. Looking at FileAccess, my suspicions were confirmed. With a trivial search, I was able to find another post where the exact same thing happened.

The responses from other users to these posts is generally to admonish the poster to use source control. While using source control is important, they are missing that this issue isn't specific to the editor; it can corrupt game save files as well.

Steps to reproduce

VM Setup

Reproduction requires simulating a system failure. I did this with VMs in VirtualBox and a USB flash drive. Using a thumb drive slows down disk operations compared to my high speed internal SSD and makes it much easier to hit the race condition between writing the file contents and renaming the file. With VirtualBox, it's easy to pass a single USB thumb drive through to the guest operating system.

For Ubuntu, I formatted the drive as ext4.

For Windows, I formatted the thumb drive with NTFS. Additionally, I had to get the Windows guest operating system to treat the thumb drive like an internal hard disk rather than an external device that could be removed at any time, meaning writes should be cached by the operating system. This is done by opening Device Manager in the Windows guest, identifying the correct USB drive under Disk Drives, right clicking it and selecting Properties, going to the policy tab, and changing the Removal Policy from Quick removal to Better performance.

Ubuntu inside of VirtualBox sometimes hanged on boot after the hypervisor reset the guest. Resetting the guest again was effective in getting a good reboot.

I was unable to get 3D acceleration working in the Windows VM guest, so Godot was unable to initialize OpenGL in the MRP. To workaround this, I hacked a simple command line interface into to the MRP than can be used in Godot's headless mode. Simply run godot --headless followed by one or more of the commands listed below.

MRP

The provided reproduction project provides a simple GUI that pseudo-randomly generates two different 100 MB files: file A and file B. Then, either file A or file B to be copied to third file: file C. Finally, file C can be compared against either file A or file B. Files A, B, and C are all placed in the project directory.

  1. Place the MRP on the thumb drive and mount it in the guest.
  2. Launch the project and generate both files A and B. In headless mode, use the gen_a and gen_b commands.
  3. Push the button to copy file A to file C. For headless, use copy_a.
  4. On Ubuntu, run the fsync command. On Windows, wait 30 seconds.
  5. Push the button to copy file B to file C. For headless, use copy_b.
  6. When the interface indicates that that the copy is complete, wait ~4 seconds. The exact time to wait will depend on the system and may take some tuning.
  7. When the designated waiting time has elapsed, immediately have the hypervisor reset the guest. In Virtualbox, this is done by pressing the Host+R key combination. It may be necessary to disable a warning dialog.
  8. After rebooting, run the MRP again.
  9. Compare file C to both file A and file B. If file C should matches either file A or file B. If file C matches neither file A nor file B, then it has been corrupted. For headless, the comparison can be done with cmp_a and cmp_b.
  10. If no corruption is found, repeat the process by copying whichever file doesn't match file C. Go to step 6.

On Ubuntu, I'm able to repeat the corruption in file C in about 1 out of 4 tries. on Ubuntu. On Windows, I can repeat the corruption on almost every try.

Minimal reproduction project (MRP)

godot-nosync-repro.zip

@jadoc jadoc linked a pull request Oct 20, 2024 that will close this issue
@jadoc
Copy link
Contributor Author

jadoc commented Oct 20, 2024

#98361 is the proposed fix. With the change, I am unable to repeat the issue on either Windows or Ubuntu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants