multi-token-prediction

Here are 2 public repositories matching this topic...

theogravity / dual-rtx-6000-blackwell-qwen3.6-27b-fp8

Optimized vLLM setup for Qwen3.6-27B-FP8 on dual RTX PRO 6000 Blackwell (192 GB GDDR7, no NVLink) ; config, benchmark sweep results, and custom chat template with thinking mode off by default.

benchmark blackwell fp8 vllm local-llm llm-inference speculative-decoding qwen3 multi-token-prediction rtx-pro-6000

Updated May 10, 2026
Shell

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

Sponsor

Star

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

docker amd cuda gemma blackwell vllm llm-inference am5 speculative-decoding fp4 prefix-caching multi-token-prediction nvfp4 rtx-6000 gemma4 tensor-parallel

Updated May 10, 2026
Shell

Improve this page

Add a description, image, and links to the multi-token-prediction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multi-token-prediction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly