-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Add .readthedocs.yaml #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
SUMMARY: Miscellaneous changes to fix nightly: * Benchmarks: - Add a benchmark name so the alert-triggering is correct - Don't skip `github-action-benchmark` failure based on previous failure - Shorten metric names so `github-action-benchmark` doesn't hit the github comment size threshold * Nightly-SOLO: - Fix code-coverage artifact name * Misc: - Add extra information to `github-action-benchmark` JSON that is useful in the UI TEST PLAN: Manual testing --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Xaenalt
pushed a commit
to Xaenalt/vllm
that referenced
this pull request
Aug 15, 2024
* Add HPU platform and HpuCommunicator for TP * remove print * whoopsie I forgot to add vllm/platforms/__init__.py * format.sh
mht-sharma
added a commit
to mht-sharma/vllm
that referenced
this pull request
Aug 21, 2024
* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <maleksan@amd.com> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <maleksan@amd.com> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. * Using the correct datatypes for streaming non-chat completions (vllm-project#134) * Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * gfx90a typo fix (vllm-project#142) Co-authored-by: maleksan85 <maleksan@amd.com> * wvsplitk templatized and better tuned for MI300 (vllm-project#132) * improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix. * [Bugfix] Dockerfile.rocm (vllm-project#141) * Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Update test-template.j2 (vllm-project#145) * Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: iotamudelta <dieterich@ogolem.org> Co-authored-by: sanyalington <shomy.sanyal@amd.com> Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com> Co-authored-by: Zachary Streeter <90640993+zstreet87@users.noreply.github.com> Co-authored-by: omkar kakarparthi <75638701+okakarpa@users.noreply.github.com> Co-authored-by: rasmith <Randall.Smith@amd.com>
mht-sharma
pushed a commit
to mht-sharma/vllm
that referenced
this pull request
Oct 30, 2024
…lm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py
wuhuikx
pushed a commit
to wuhuikx/vllm
that referenced
this pull request
Mar 27, 2025
### What this PR does / why we need it? In the case where `backend = ray`, only the main process completes the `forward_oot` call, while the other worker processes call `forward_native`. (This bug should also exist when `backend = mp`.) ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? **Environment:** CANN: 8.0.0 PyTorch: 2.5.1 Torch: 2.5.1rc1 python: 3.10 python: 3.10 vllm: branch main vllm-ascend: branch main The current implementation avoids the Ray Worker initialization issue, as addressed in the [PR](vllm-project/vllm-ascend#92). Then, during the `forward_oot` call, logging will be performed. **Script:** ```bash python examples/offline_distributed_inference_npu.py ``` **Result:** ```bash NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# (NPURayWorkerWrapper pid=3984223) forward_oot run. ############################################# forward_oot run. ############################################# forward_oot run. ############################################# Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.96s/it, est. speed input: 2.80 toks/s, output: 51.00 toks/s] Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 16 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of the X-linked recessive disorder. I have been told that I will not be able to have children because of' Prompt: 'The president of the United States is', Generated text: ' Statesman. He is the leader of the country. He is the one who makes the decisions. He is the one who makes the laws. He is the one who makes the rules. He is the one who makes the country strong. He is the one who makes the country happy. He is the one who makes the country safe. He is the one who makes the country free. He is the one who makes the country beautiful. He is the one who makes the country great. He is' Prompt: 'The capital of France is', Generated text: ' the city of Paris. It is the largest city in France and the second largest city in Europe. It is located in the center of the country, in the south of the country. It is situated on the banks of the Seine River, which flows through the city. The city is surrounded by the Alps and the Pyrenees mountains. The city is also surrounded by the Mediterranean Sea. The city is known for its beautiful architecture, its museums, its parks, and its food. Paris is' Prompt: 'The future of AI is', Generated text: ' following the path of the internet, and the internet is following the path of the web. The web is a network of interconnected web pages, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network of interconnected computers, and the internet is a network of interconnected computers. The web is a network' ``` --------- Signed-off-by: Chenguang Li <757486878@qq.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
.readthedocs.yaml
for building our hosted document.