-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support InternLM #4137
Support InternLM #4137
Conversation
@microsoft-github-policy-service agree |
954dac5
to
bdf8a90
Compare
Hi @wangruohui, Thank you for expanding our inference capabilities and adding the container for InternLM. Can you please provide an example of a model being used with our inference test? For example, for the BLOOM model, a command may look like this:
It would be nice to have a similar command for InternLM so we can do some testing on our side as well. Thanks! |
Hello @lekurile I am working on the test script you provided. Some note:
So all test commands below are based on this directory
Test resultsHF Baseline
Use Kernels
Tensor Parallel
|
@wangruohui Appreciate the very detailed testing and reproduction details. I've merged the latest in master and kicked off some tests. I'll also try getting the model to run on my end as well. Thanks, |
Hello, Any updates? |
Hi @wangruohui, Looks like the changes in this PR may have caused some issues with other models, specifically in the following unit tests:
I'm suspecting this is due to changes in Can you kindly test these models for compatibility with the InternLM changes on your side? I can try testing as well. Thanks, |
Hello @lekurile I made some modification to make GPTNeo compatible but I cannot set up an exactly the same env to run all tests. Would you please allow the CI to run to check if the problem is solved? |
* origin/master: Allow multiple inference engines in single script (microsoft#4384) adds triton flash attention2 kernel (microsoft#4337) Fix llama meta tensor loading in AutoTP and kernel injected inference (microsoft#3608) Fix min torch version (microsoft#4375) Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND (microsoft#4373) add the missing method (microsoft#4363) Openfold fix (microsoft#4368) deepspeed4science japanese blog (microsoft#4369) deepspeed4science chinese blog (microsoft#4366) Enable workflow dispatch on Torch 1.10 CI tests (microsoft#4361) Update conda env to have max pydantic version (microsoft#4362) add deepspeed4science blog link (microsoft#4364) added check to avoid undefined behavior when the input_id length is greater than max_tokens (microsoft#4349) Add the policy to run llama model from the official repo (microsoft#4313) fix deepspeed4science links (microsoft#4358) DeepSpeed4Science (microsoft#4357) Support InternLM (microsoft#4137) Pass base_dir to model files can be loaded for auto-tp/meta-tensor. (microsoft#4348)
This PR is to support a new model named InternLM.
This model is similar to llama but with bias on qkvo matmul. So I primarily duplicate codes of llama model and add support for bias in Attention's python code.
This branch is previously checkoutted from tag v0.10.0 and I have tested locally on single GPU or multiple with TP.