-
Notifications
You must be signed in to change notification settings - Fork 5
[DEP] - Consolidating dynamo-serve and dynamo-run component API #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is missing from the doc for me is the user flow, namely: I'm a user, I come to the project with some goal in mind (what is that goal), and then what are the steps I need to do to achieve that goal, both in the current world (which we claim is not good) and in the proposed new world.
## Goals | ||
|
||
**\[Optional \- if not applicable omit\]** | ||
The frontend actually uses `dynamo-run` internally to run the OpenAI-compatible HTTP server and Rust-based processor, while `dynamo-serve` manages the overall service orchestration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean that the frontend uses dynamo-run
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frontend component of a graph runs dynamo-run
as a subprocess in order to start the rust server/processor. Here's an example https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/components/frontend.py
|
||
1. Each Dynamo component **MUST** be runnable using `python3 component.py <flags>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain why this is critical? The explanation of "this is more Pythonic/hackable" isn't enough - what is a user trying to do for which they need this, what is the workflow that the lack of this is blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example is running a single model split across 2 nodes. Lets use SGLang Disagg as an example
In order to run TP>8 - you have to run the sglang worker with arg node_rank=0
on the head node and then node_rank=1
on the child node. In the current paradigm - that means you have to do
# node 1 - spin up the frontend, processor, and shard 1 of prefill worker but not any of the decode worker
dynamo serve graphs.agg:Frontend -f configs/agg_head.yaml
# node 2 - spin up shard 2 of the prefill worker
dynamo serve graphs.agg:Frontend -f configs/agg_child.yaml --service-name SGLangWorker
# node 3 - spin up only shard 1 of the decode worker
dynamo serve graphs.disagg:Frontend -f configs/disagg_head.yaml --service-name SGLangDecodeWorker
# node 4 - spin up only shard 2 of decode worker
dynamo serve graphs.disagg:Frontend -f configs/disagg_child.yaml --service-name SGLangDecodeWorker
Note that I am having to run agg on the first 2 nodes, disagg on the next 2, and specify the --service-name
on nodes 2 through 4. It would be great if I could just do
#node 1
dynamo-run in=http out=dyn &
python3 prefill.py -f configs/disagg_head.yaml
# node 2
python3 prefill.py -f configs/disagg_child.yaml
# node 3
python3 decode.py -f configs/disagg_child.yaml
# node 3
python3 decode.py -f configs/disagg_child.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to use vllm_v1 to demonstrate the following
Another example is running each component on a different node. Say I have 4 H100 nodes and I am trying to run a
- Frontend
- Load Balancer
- 2 PrefillWorker
- 2 DecodeWorkers
I'll try a different approach to the one above and instead try to just run each component separately without picking and choosing from some link
graph. Here's what I have to do to spin things up on node 1
# node 1 - spin up the CPU processes (frontend and load balancer) and prefill worker 1
# I have to keep specifying service-name or else the `depends` will force everything to spin up
# unless I sed the `depends` out....bad UX
dynamo serve components.frontend:Frontend -f config.yaml --service-name Frontend
dynamo serve components.simple_load_balancer:SimpleLoadBalancer -f config.yaml --service-name SimpleLoadBalancer
dynamo serve components.worker:VllmPrefillWorker -f config.yaml
I want to do the following on node 1
dynamo run in=http out=dyn
python3 simple_load_balancer.py -f config.yaml
python3 prefill_worker.py -f config.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details. On the first example, I am not sure I see a huge delta between the two cases. On the second, I can see some of the issues. In general, kind of odd to see config.yaml used in both cases, I'd expect the pure Python version to be just command line args.
Perhaps more importantly, the use case for this seems to be more for the "Dynamo developer", rather than a normal user? Not to say that isn't important, but just want to make sure we're talking about the same persona.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the first example, I am not sure I see a huge delta between the two cases
I guess something like dynamo serve graphs.agg:Frontend -f configs/agg_child.yaml --service-name SGLangWorker
is a bit unintuitive for me. Why am I accessing the entire graph and selecting the PrefillWorker service from it? Is it not easier just to run the prefill worker python code?
Additionally - in order to run this full example across nodes - I have to run the agg graph and then run the disagg graph in order to avoid the depends
statement in the prefill worker. If I don't then the decode worker spins up and causes an error because the it has no GPUs available.
In general, kind of odd to see config.yaml used in both cases, I'd expect the pure Python version to be just command line args.
Yea CLI args also work. A YAML/JSON parser would be helpful because these FWs require many many args to get things spun up sometimes. But this might just be the Dynamo Developer persona talking :)
Perhaps more importantly, the use case for this seems to be more for the "Dynamo developer", rather than a normal user? Not to say that isn't important, but just want to make sure we're talking about the same persona.
This is an interesting thought and it does bring up the question of who our ICP is for Dynamo? In my eyes - people that will be deploying our software at data center scale will want maximum flexibility to tinker with our FW (something I don't believe is present if I cannot run python code with python3
). Ping to @harryskim for viz on this discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the form should be something like:
dynamo serve graph {optional config} --node-list node_list.yaml
If you are serving a graph and you need multiple nodes for that graph that should be encapsulated in serve ....
having to run the command on multiple nodes with different slices - does seem to defeat the purpose of dynamo serve . dynamo serve, serves a graph with the required resources. So that should by definition be able to handle multi-node graphs.
It doesn't today. And maybe - with things like microk8s - we don't need this intermediate state of serving a graph without k8s or slurm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM and SGLang both let you run disagg serving + kv routing using just raw python in single/multinode settings.
To be clear - if I can run python3 sglang.launch_server ....
to run an SGLang prefill worker, why can't I do python3 prefill_worker.py ...
in Dynamo? That is a degradation in UX to me
|
||
\<bulleted list or pros describing the positive aspects of this solution\> | ||
1. We **SHOULD NOT** deprecate `dynamo serve` - the orchestration UX remains valuable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will contend that this is incorrect, or rather, that an explicit goal is that there is only one way to run a Dynamo "graph", and we cannot have both dynamo-run
and dynamo serve
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with you here. We've been using dynamo-run
to describe the launcher for the rust frontend and the processor and less as a way to launch the other components. That's why its also used in the frontend. It's not really a launcher for the graph. It's a launcher for the rust bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood - just not a good situation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should only be one way to:
- define and implement a component in python and rust
- define a graph in python and rust (maybe yaml?)
- launch a graph locally
- launch a graph in k8s and slurm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed there should be one way. I'm not sure I understand why dynamo-run is needed if the components can run via their main function and dynamo serve
launches a graph.
async for result in g: | ||
yield result | ||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code going to be 90% the same in every component? Not saying that's a problem, but want to make sure I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea 90% is about right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something for us to think about then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach is having these bits in serve_dynamo.py
which is also why we cannot run each component without dynamo-serve
. I personally am not a fan of abstractions especially since we have not found a stable API yet which is why I'm advocating for this being explicitly written out. I think it very much helps readability as well.
If we can find a middle ground I'm 100% open to looking at things differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, that wasn’t a “this is bad and we should think how to fix this”, but rather for us to decide how we feel about it.
An answer might be “we are fine with this for now and will revisit when we see stability or it starts really hurting”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would separate the "how" from the "what" - meaning we have a set of requirements - the exact look and feel of the decorators can change. They should all be part of the runtime and not part of deploy.
That is - I'm not sure the boiler place of defining a component, a service, attaching a method to an endpoint is all really necessary if it can be hidden behind a decorator:
@dynamo_component(namespace="")
class Foo:
@dynamo_endpoint()
def process_image(x:type):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the abstractions need to be ironed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, its a lot of repeated boiler plate and a simple method to wrap these common pattern will follow DRY pattern.
asyncio.run(worker(engine_args)) | ||
``` | ||
|
||
### FlexibleArgumentParser for Dynamo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show an example of this?
│ └── utils.py | ||
``` | ||
|
||
Each engine's `utils.py` would contain things like argument parsing/validation and any other helper functions. The `common/base_classes.py` could contain abstract classes for the `BaseDecodeWorker` or the `BaseWorker` class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. I think we need to find a way where the "common" path (i.e. running one of these engines) is not dependent on "examples", but rather properly in the product surface area. But likely a separate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we establish a direction here in terms of how to define a component or any changes we want to make there - the goal is to move production components to top level:
components/engines
components/engines/vllm
components/common/
components/
this can then be a sub package:
dynamo.components
along side
dynamo.runtime
dynamo.llm
- Remove `async_on_start` | ||
- Remove `dynamo_endpoint` | ||
- Remove runtime injection via `dynamo_context` | ||
- We can start using abstract classes for our request handlers to address the issue to repeated code for things like `Router` and `Frontend` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue for abstraction is not the definition of the request handlers - but the proxy / use objects.
That is, how to abstract the component, and namespace from the caller in a typed way.
Abstraction for the request handlers - is great for reuse at the implementation of the request handler - but I think doesn't complete the 'proxy objects' on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, how to abstract the component, and namespace from the caller in a typed way.
I think this is addressed in the Deployment class no? I also don't think I fully understand what you mean by proxy/use objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like:
The specific implementation of "LLMWorker" type is hidden from the client but still typed ....
to me that was the key abstract of the Abstraction
Does Deployment class do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific implementation of "LLMWorker" type is hidden from the client but still typed ....
This depends is not even used for that purpose. It's a remnant of BentoML. It's only there so that the LLMWorker will spin up using dynamo serve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As others have said, The point of having the LLMWorker
interface was to define types. ie so I can swap-out a different LLM backend if I want with the same interface.
In addition I think we can remove a lot of boilerplate in setting up the endpoints, if we have some base component or decorators.
What would BaseDeployment do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel as if this combines a few different issues into one:
- how dynamo serve and dynamo run relate to each other
- how to launch a graph locally in multi-node settings
- how to define a component in python and rust
- how to define a graph in python and rust - and what the semantics of a graph are
The proposal sets out to "consolidate dynamo serve and dynamo run" - but really lays out how to define a component in a way that it can be launched independently of either.
I'm still in favor of separating these out into separate proposals:
component object model:
graphs:
https://github.com/ai-dynamo/enhancements/blob/graphs/deps/NNNN-dynamo-graphs.md
dynamo run / dynamo serve / dynamo deploy
https://github.com/ai-dynamo/enhancements/blob/deploy/deps/NNNN-run-and-deploy.md
I think the basic reason we have the particular issues you point out are precisely that we've coupled how to launch a graph (dynamo serve) with how you define a component (the decorators that dynamo serve uses). These should be defined separately - how you define a component and how you define a graph should be loosely coupled to how you launch them.
We also need to separate dynamo-run
as an application from dynamo run
as a workflow.
dynamo-run
as an application is the http frontend, pre-processor, and router as a component.
|
If BentoML was not a part of our codebase - I have a strong feeling we would not have used the class decorator. According to my chat with Mohammed - that decorator is used to make the Bento style K8s deploymen functional (also something that might not have been done had we not used Bento in the first place). Everything else that the class decorator provides can be accomplished with the Deployment class. |
Today I worked on adding an embedding component to SGLang. One thing I realized is that we cannot run a new graph without creating a new file that states If I just want to have dynamo-run started and quickly iterate on the embedding code - it's not possible without continuously starting and restarting dynamo serve. |
@dynamo_worker() | ||
async def worker(runtime: DistributedRuntime, engine_args): | ||
worker = SGLangDecodeWorker(engine_args) | ||
deploy = Deployment() | ||
|
||
component = runtime.namespace(deploy.namespace).component(deploy.name) | ||
await component.create_service() | ||
|
||
# serve endpoint | ||
endpoint = component.endpoint("generate") | ||
await endpoint.serve_endpoint(worker.generate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have the correct abstractions this boilerplate can completely go away.
This gets worse if I want to have multiple endpoints in a component / service.
Summary of discussion 6/12Members present: @ishandhanani @nnshah1 @mohammedabdulwahhab @alec-flowers @rmccorm4 @messiaen @biswapanda Summary:
@mohammedabdulwahhab shared his proposal. By the end of the our conversation - we came to the following agreement
Next steps:
Open Questions
|
No description provided.