DantinoX: A modular, memory-efficient Transformer implementation in JAX/Flax NNX. Includes Sparse MoE, GQA, Sliding Window Attention, Gradient Accumulation and Checkpointing
-
Updated
May 5, 2026 - HTML
DantinoX: A modular, memory-efficient Transformer implementation in JAX/Flax NNX. Includes Sparse MoE, GQA, Sliding Window Attention, Gradient Accumulation and Checkpointing
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
🎥 Discover Vidar, a unified embodied video foundation model designed for low-resource environments, enhancing video understanding and generation.
Add a description, image, and links to the pre-training topic page so that developers can more easily learn about it.
To associate your repository with the pre-training topic, visit your repo's landing page and select "manage topics."