Description
By default, Linux prevents tmpfs from using more than 50% of available system memory. This is normally a good thing, but the simple convert scripts write all tensor data to tmpfs before saving the output file, causing this exception if the converted model is larger than 50% of system RAM (ref):
Traceback (most recent call last):
File "/home/cebtenzzre/src/forks/llama.cpp/convert-baichuan-hf-to-gguf.py", line 279, in <module>
gguf_writer.add_tensor(new_name, data)
File "/home/cebtenzzre/src/forks/llama.cpp/gguf-py/gguf/gguf.py", line 622, in add_tensor
tensor.tofile(self.temp_file)
OSError: Not enough free space to write 140247040 bytes
This is annoying. You can set TMPDIR=/var/tmp to work around this, but then you need twice as much free disk space as the size of the output file - which I don't always have.
The smallest change that would fix this problem would be to provide a way to effectively disable use_temp_file on Linux, while still supporting use of e.g. /var/tmp if desired. That way, I could leverage 100% of my RAM, and my swap space, in order to convert these models. If we choose this route, we should make sure the converted tensor data is freed as it is written, to avoid unnecessary swapping - right now it is not.
We can't just change the simple scripts to convert on-the-fly, because they load one pytorch file at a time, so making two passes over the input tensors would have a high I/O cost.
We could make LazyUnpickler part of the gguf module. That way, we can significantly reduce memory usage, and avoid the need to load the entire model into memory, while hopefully still keeping the non-LLaMA conversion scripts relatively simple.
@ggerganov what do you think?