[Vision] Toward Dynamo 2.0

# Dynamo vision for 2.0

Hi Dynamo developers!

Apologies for the long delay, but we wanted to share a roadmap update. After delivering Dynamo 1.0 at GTC '26, we took a short breather before returning to full development mode.

With Dynamo 1.0, we focused on delivering performant, production-grade LLM inference serving. We also previewed early multimodality and diffusion support, which gave us a foundation to expand Dynamo beyond traditional LLM serving.
  
<img width="1416" height="485" alt="Image" src="https://github.com/user-attachments/assets/8014afea-e0dc-427a-b77c-8a0936500c5e" />

As we march toward Dynamo 2.0 in 2026, we will expand our coverage to include all GPU usage except pre-training.

<img width="1445" height="600" alt="Image" src="https://github.com/user-attachments/assets/bf0ac3c8-9210-4cf7-8800-dc149e218817" />

That broader scope has three major implications. Dynamo will bring agentic capabilities to the LLM and multimodal support delivered last year, extend non-LLM use cases beyond multimodal and diffusion, and serve as a performant and resilient rollout engine for RL frameworks.

For non-LLMs, we will focus on enabling key use cases such as generative recommendation and voice. We also know that many bespoke non-LLM models may not be supported by popular inference engines, so we recently created a Dynamo component called [AITune](https://github.com/ai-dynamo/aitune) to help any PyTorch model run performantly with Dynamo. AITune automatically finds the best-performing backend (TorchAO, TorchInductor, TRT, etc.) and lowers precision with only a few lines of code. This is especially helpful when hundreds of bespoke PyTorch models need to be systematically tuned.

Together, these investments point toward a clear goal: comprehensive support for LLM inference, non-LLM inference, and RL post-training by March 2027.

## Top 5 priorities for Dynamo

Dynamo will focus on the following prioritized workstreams. These priorities are not meant to cover the entire scope, but they are the areas where we believe focused execution will have the highest impact:

* Performance
* Agents
* RL post-training
* Omni 
* Heterogeneous hardware support

In the following sections, we will describe the general approach for each workstream, starting with the major use cases and then connecting them back to cross-cutting initiatives like performance and heterogeneous hardware support. Performance will remain Dynamo's #1 value proposition across all of these efforts.

### Agents
Recently, Dynamo enabled developers to pass [agentic hints](https://docs.nvidia.com/dynamo/user-guides/agents), such as priority, predicted OSL, and latency sensitivity, through nvext (API extension). These hints guide Dynamo to schedule more effectively for improved performance, as described in this [blog article](https://developer.nvidia.com/blog/full-stack-optimizations-for-agentic-inference-with-nvidia-dynamo/). This is only the first step. We plan to evolve our approach to agents by treating agentic execution more like compiler optimization.

Recent research such as ThunderAgent treats agents as workflow of programs, and our view is similar. Just as compilers analyze programs to optimize execution, Dynamo can analyze agentic workflows to improve serving performance. Some workflows are relatively static, while others are highly dynamic. For example, trip planning may invoke a predefined sequence of LLM and tool calls, while a long-running coding task may branch, retry, and adapt continuously.

<img width="1200" height="564" alt="Image" src="https://github.com/user-attachments/assets/d5331e05-be5b-45dc-8382-ad8db226652b" />

For static workflows, we can profile the execution path and extract richer agentic hints that help Dynamo schedule agents and tool calls more effectively. Dynamic workflows create a different but equally important opportunity: Dynamo needs better visibility into how the workflow evolves at runtime. The recently added [agentic tracing capability](https://docs.nvidia.com/dynamo/v1.1.0/user-guides/agents/agent-context-and-tracing) is a step in that direction.

As these workflows become more complex, placement will matter more. Agents, sub-agents, and tool calls may have different latency, throughput, and cost requirements, and agentic profiling can help Dynamo place each part of the workflow on the right hardware. This is another place where the compiler analogy becomes useful: Dynamo should reason about execution plans, resource placement, and runtime behavior together.

Performance is only one side of the agentic serving problem. The Dynamo team is also interested in supporting long-running, and eventually always-on, agents. Dynamo provides its own KV caching component, called KV Block Manager, to offload KV to local disk and remote storage, which helps alleviate KV pressure created by long-running agents. We plan to expand this capability to support distributed KV caching with P2P DRAM-to-DRAM transfer and global shared object/file storage.

To help optimize TCO for always-on agents, we recently released [FlexTensor](https://github.com/ai-dynamo/flextensor), which streams model weights from host memory to GPU memory.

FlexTensor enables models with large memory footprints to run on GPUs with smaller HBM capacity, even when those models would not normally fit entirely in GPU memory. One useful application is running a smarter, larger-parameter agent more cost-effectively on an older GPU SKU, with some tradeoff in latency.

For always-on agents, this tradeoff can be attractive because they are often used asynchronously, where higher latency is less of a concern and TCO savings matter more.

### RL post-training
The same serving principles become even more important in RL post-training, where inference is not a standalone workload but part of the training loop itself. Our focus here is to make Dynamo a powerful rollout engine that can help boost ecosystem frameworks such as VeRL, Slime, NeMo RL, Prime-RL, and Miles. RL workloads look different from normal inference because rollout generation sits directly inside the training loop. When rollouts slow down, the trainer waits. When weight updates are slow or unclear, the policy can become stale. If rollouts fail due to infrastructure issues, the consequences are higher than a single failed inference request. This makes serving reliability and observability directly tied to training efficiency.

<img width="1040" height="600" alt="Image" src="https://github.com/user-attachments/assets/a094c514-16d5-47c1-9d59-30b37c9da010" />

Our first goal is to provide a better rollout serving contract. Dynamo should accept token IDs when a framework owns tokenization, return authoritative output token IDs with aligned logprobs, expose finish reasons and routing metadata, and make weight/LoRA update behavior explicit. This lets RL frameworks and teams retain ownership of algorithms and sample construction while Dynamo focuses on the serving orchestration layer. We also want this contract to be standardized across our backend engine frameworks: SGLang, TRT-LLM, and vLLM.

Once that contract is in place, routing becomes one of the largest opportunities for RL. Multi-turn and agentic RL requests benefit from KV locality, agentic hints, and sticky session behavior, while LoRA-based experiments need routing that understands adapter placement. Over time, Dynamo's router should use policy version, worker load, KV pressure, LoRA availability, and worker pool topology to send rollout requests to the right worker.

Weight movement is another key part of this story. RL loops frequently refresh weights, and larger models make naive reloads too slow. We will continue to build on ModelExpress and NIXL RDMA so rollout workers can receive new weights faster, publish their loaded version, and avoid taking traffic before they are ready.

Beyond routing and weight movement, RL systems need elasticity and fault handling tuned for training loops. Dynamo Planner should autoscale rollout capacity and rate-match with trainer demand across clusters to prevent a rollout worker from stalling the entire run. Dynamo should also provide a stable platform that can drain stale workers cleanly and expose metrics such as trajectories/sec, version lag, queue depth, and rollout tail latency. The goal is for Dynamo to make RL rollout infrastructure faster, more reliable, and easier to operate across engines.

### Omni

Beyond LLMs and RL, we expect diffusion and multimodality to see major breakthroughs this year. We are closely collaborating with innovators like [Hao AI Lab at UCSD](https://haoailab.com/) to bring the latest advances in diffusion to production systems. In a recent blog post, Hao AI Lab announced the [DreamVerse project](https://haoailab.com/blogs/dreamverse/), which enables developers to generate 1080p video with audio in real time from user prompts.

We believe diffusion will fundamentally change media production and reshape the media, gaming, and advertising industries. As agents become capable of generating prompts with very low latency, video diffusion models can reuse previously generated latent space to continue producing the next segment of video. This creates a path toward long-form video and audio generation with far less friction. Low-latency prompt generation will be critical in this workflow, and LPUs may be especially well suited to fill that gap.

That shift opens up entirely new creative workflows. A strong writer could independently create high-resolution, long-form films, while a gamer could provide live inputs to an agent that continuously generates new worlds to explore.

To make these experiences practical, the serving system has to support continuous generation, low-latency coordination, and efficient communication across pipeline stages. The Dynamo team has already been working on streaming inputs and outputs, as well as optimizing network latency between components in diffusion pipelines.

Multimodality will also depend heavily on heterogeneous hardware. Since last year, Dynamo has supported disaggregating embedding from the prefill and decode stages of VLM models. We have been collaborating with the Intel XPU team to place embedding processing on Intel B60 while running prefill and decode on NVIDIA H200, where we have observed strong performance gains. We also expect LPUs to help significantly reduce latency for voice models.

## Conclusion

Dynamo 2.0 is about expanding from high-performance LLM serving into a broader inference and post-training platform. Agents, RL rollouts, diffusion, multimodality, and bespoke non-LLM models each bring different requirements, but they all depend on the same core capabilities: performance, routing, elasticity, observability, fault tolerance, and efficient use of heterogeneous hardware.

Our goal is to make Dynamo the system layer that helps developers serve these workloads reliably and cost-effectively, whether they are building long-running agents, scaling RL post-training, generating real-time multimodal media, or tuning large fleets of PyTorch models. We are excited to build this next chapter with the community and will continue sharing progress as Dynamo moves toward 2.0.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vision] Toward Dynamo 2.0 #9208

Dynamo vision for 2.0

Top 5 priorities for Dynamo

Agents

RL post-training

Omni

Conclusion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Vision] Toward Dynamo 2.0 #9208

Description

Dynamo vision for 2.0

Top 5 priorities for Dynamo

Agents

RL post-training

Omni

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions