A GPipe implementation in PyTorch
-
Updated
Jul 25, 2024 - Python
A GPipe implementation in PyTorch
An I/O benchmark for deep Learning applications
Extending DOLFINx with checkpointing functionality
Keras wrapper that autosaves what ModelCheckpoint cannot.
A Python package for checkpointing, saving, and loading objects.
A Python package for performing memory-intensive computations in parallel using chunks and checkpointing.
Code and tutorial on integrating wandb sweeps with Slurm pre-emption
Robust distributed checkpointing and job management system for multi-GPU SLURM workloads
Work in progress
A digital album face recognition manager, that isolates images of a specified person from a digital album.
Offline, fail-closed verifier for JSONL telemetry event logs. Emits deterministic audit certificates + human summaries with explicit claims/non-claims for bottleneck and integrity review.
Currently exploring Generative AI to deepen my understanding and skills within web development. Focused on learning how to integrate GenAI into real-world applications and solve practical problems through intelligent automation.
Automatic checkpointing and job resubmission system for robust LLM training on Slurm-based HPC clusters. Collaboration with @vulus98
A criticality-aware H.264 encoder simulation that models how different encoding blocks can be protected with resilience strategies such as retries, ECC, TMR, and checkpointing. This project demonstrates how fault-tolerant design principles can be applied to video compression pipelines.
A Multi-Hop Retrieval Augmented Generation (RAG) system with Multi-Agent LangGraph workflow for intelligent educational analytics. Features real-time workflow visualization, Postgres vector embeddings, and checkpoint-based resumption for robust query processing.
Fault-tolerant distributed training framework with async checkpointing for LLM's
Add a description, image, and links to the checkpointing topic page so that developers can more easily learn about it.
To associate your repository with the checkpointing topic, visit your repo's landing page and select "manage topics."