Congrats on the release! Some of the features are so cool they feel like black magic. Would it be possible to explain the key techniques behind those features, or provide a tutorial/demo so users can reproduce the claimed results? <img width="870" height="553" alt="Image" src="https://github.com/user-attachments/assets/963fe7ab-6f2a-4dd3-b6ec-4e3076d45ae1" /> In particularly, I am interested in the claim > Memory-efficient design: Train 200B MoE models on 64k sequence lengths without sequence parallelism through advanced memory optimization techniques It sounds very challenging, unless there’s aggressive offloading and recomputation and may suffer from slow iteration speed.