Support Micro Batching and Lossess Compression#39
Merged
HaibaraAiChan merged 24 commits intoFeb 20, 2026
Conversation
HaibaraAiChan
approved these changes
Feb 20, 2026
JiuChen0
added a commit
to JiuChen0/BloomBee
that referenced
this pull request
Mar 22, 2026
* micro batching slice * cross stage * cross stage overlap * fix shape mismatch * fix cross stage error * cross stage pipeline * micro batch size reuse * pipeline * overlap * fix kvcache BH_dst * fix batch mismatch * delete debug print * cross stage overlap * fix micro batching * micro batching * add timer tracking * finish micro batching * merge code of sepc decoding * spec decoding max length test * add compression * spec decoding token limit error * fix disable micro batch error * fix micro batching index unstable problem
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added Micro-batching support: large batches can be split into micro-batches, with GPU slot reuse / multiplexing on the KV cache.
microbatch_config.py,memory_cache_manager.py,block_functions.py,handler.pyAdded cross-stage overlap (compute/communication overlap): enables micro-batch–level asynchronous push/consume pipelining.
handler.py,block_functions.py,microbatch_config.pyAdded Zstd lossless compression for activation transport packaging before/after transfer.
lossless_transport.py,lossless_wrapper_config.pyMerged the Speculative Decoding path into the existing inference pipeline. For now, speculative decoding remains on the full batch-size path (does not use micro-batching) to ensure correct alignment across tree/draft/KV.
speculative_model.py,inference_session.py,handler.py,block_functions.py