A distributed, POSIX-compatible filesystem optimized for PyTorch training workloads
TorchFS addresses the gap between general-purpose distributed storage and the sequential, read-heavy patterns of modern deep learning. By plugging directly into the PyTorch ecosystem via FUSE, TorchFS exposes a familiar POSIX interface while retooling its backend for training-centric performance and resilience.
Key design highlights:
-
Epoch-Aware Caching & Prefetching
Utilizes training-step hints to keep upcoming data locally hot, reducing per-epoch latency and sustaining GPU utilization. -
Horizontal Scalability
Object-striped storage nodes (SNs) add bandwidth and capacity linearly, and can join the cluster by simply registering with the metadata service. -
Metadata–Data Separation
A Raft-based metadata cluster (MDNs) manages namespace operations independently of bulk I/O, preventing metadata contention from interrupting tensor streaming. -
Erasure-Code Resilience
Reed–Solomon encoding across k + m fragments delivers fault tolerance with low storage overhead, while client-side decode/encode hides node failures from training jobs.
To read more about our decisions, look into our paper inside this repository.
- Install the required system packages
sudo apt update
sudo apt install -y \
cmake \
g++ \
gcc \
pkg-config \
liberasurecode-dev \
librocksdb-dev \
libprotobuf-dev \
libgflags-dev \
libfuse-dev \
libfuse3-dev \
flex \
bisonAlso make sure you have VCPKG installed in your machine.
- Clone the repository
git clone https://github.com/danielsp45/torchFS.git
cd torchFS- Install dependencies
bin/buildTo run the project you need to first start off the metadata service:
bin/metadataThen you can start the storage nodes:
bin/storageFinally, you can mount the filesystem:
bin/client