-
Notifications
You must be signed in to change notification settings - Fork 0
Add ServerlessLLM Support #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if not os.path.exists(rank_path): | ||
os.makedirs(rank_path) | ||
# save tensors | ||
tensor_offsets = save_tensors(tensor_names, tensor_data_index, rank_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call save models instead of save_tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are executing save/load state dict, not model, need to use save tensors directly
model = _initialize_model(model_config, self.load_config, | ||
lora_config, vision_language_config, | ||
cache_config) | ||
state_dict = self._filter_subtensors(model.state_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this create another copy of model parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy exactly from vllm implementation, we inherit the save behaviour
tensor_copy_chunks = {rank: []} | ||
for idx, (name, param) in enumerate(state_dict.items()): | ||
data_ptr = param.untyped_storage().data_ptr() | ||
memory_ptrs[rank].append(data_ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, we have one unique rank, right? so why do we need this rank here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used in with get_cuda_memory_handles API
offset, size, _, _, _ = tensor_index[name] | ||
|
||
tensor_copy_chunks[rank].append( | ||
(offset, size, 0, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every tensor has its own base address, so GPU offset is always 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the save_serverless_llm_state
method be copied into normal GPUExecutor
so that for single GPU, user do not need to specify backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both our save and load method do not rely on specific executor, right?
local_model_path = model_config.model | ||
local_model_path = os.path.join(local_model_path, f"rank_{rank}") | ||
model_name = local_model_path.split("/")[-2:][0] + "/" + local_model_path.split("/")[-1] | ||
ret = client.load_into_cpu(model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client here is not defined yet
tensor_data_index[name] = (data_ptr, size) | ||
|
||
print(tensor_data_index) | ||
rank_path = path + f"rank_{rank}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local path seems different compared with load part. I guess it should be os.path.join(path, f"rank_{rank}")
?
This comment was marked as resolved.
This comment was marked as resolved.
with open(os.path.join(rank_path, "tensor_index.json"), "w") as f: | ||
json.dump(tensor_index, f) | ||
|
||
save_dict(state_dict, os.path.join(path, f"rank_{rank}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't found the save dict function under serverless_llm_store, where can I build the latest version with save_dict
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use latest xly/fix-docker-build from serverlessllm
# move all tensors to CPU | ||
for key, tensor in state_dict.items(): | ||
state_dict[key] = tensor.cpu().contiguous() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add
os.makedirs(os.path.join(path, f"rank_{rank}"), exist_ok=True)
here or it may have failed to open file error
Failed to open file ./models/opt-125m/rank_0/tensor.data_0
This PR creates new loader to use ServerlessLLM interface