Release DJLServing v0.23.0 release · deepjavalibrary/djl-serving

Key Features

Introduces Roling Batch
- SeqBatchScheduler with rolling batch #803
- Sampling SeqBatcher design #842
- Max Seqbatcher number threshold api #843
- Adds rolling batch support #828
- Max new length #845
- Rolling batch for huggingface handler #857
- Compute kv cache utility function #863
- Sampling decoding implementation #878
- Uses multinomial to choose from topK samples and improve topP sampling #891
- Falcon support #890
- Unit test with random seed failure #909
- KV cache support in default handler #929
Introduces LMI Dist library for rolling batch
- Rolling batch support for flash models #865
- Assign random seed for lmi dist #912
- JSON format for rolling batch #899
- Add quantization parameter for lmi_dist rolling batch backend for HF #888
Introduces vLLM library for rolling batch
- [VLLM] add vllm rolling batch and add hazard handling #877
Introduces PEFT and LoRA support in handlers
- Add peft to fastertransformer container #889
- Add peft support to default deepspeed and huggingface handlers #884
- Add lora support to ft default handler #932
Introduces streaming support to FasterTransformer
- Add Streaming support #820
Introduces S3 Cache Engine
- S3 Cache Engine #719
Upgrades component versions:
- Upgrade PyTorch to 2.0.1 #804
- Update Neuron to 2.10 #681
- Upgrade deepspeed to 0.9.5 #804

Enhancement

Serving and python engine enhancements

Adds workflow model loading for SageMaker #661
Allows model being shared between workflows #665
Prints out error message if pip install failed #666
Install fixed version for transformers and accelerate #672
Add numpy fix #674
SM Training job changes for AOT #667
Creates model dir to prevent issues with no code experience in SageMaker #675
Don't mount model dir for no code tests #676
AOT upload checkpoints tests #678
Add stable diffusion support on INF2 #683
Unset omp thread to prevent CLIP model delay #688
Update ChunkedBytesSupplier API #692
Fixes log file charset issue in management console #693
Adds neuronx new feature for generation #694
[INF2] adding clip model support #696
[plugin] Include djl s3 extension in djl-serving distribution #699
[INF2] add bf16 support to SD #700
Adds support for streaming Seq2Seq models #698
Add SageMaker MCE support #706
[INF2] give better room for more tokens #710
[INF2] Bump up n positions #713
Refactor logic for supporting HF_MODEL_ID to support MME use case #712
Support load model from workflow directory #714
Add support for se2seq model loading in HF handler #715
Load function from workflow directory #718
Add vision components for DeepSpeed and inf2 #725
Support pip install in offline mode #729
Add --no-index to pip install in offline mode #731
Adding llama model support #727
Change the dependencies so for FasterTransformer #734
Adds text/plain content-type support #741
Skeleton structure for sequence batch scheduler #745
Handles torch.cuda.OutOfMemoryError #749
Improves model loading logging #750
Asynchronous with PublisherBytesSupplier #730
Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
Sets default minWorkers to 1 for GPU python model #755
Fixes log message #765
Adds more logs to LMI engine detection #766
Uses predictable model name for HF model #771
Adds parallel loading support for Python engine #770
Updates management console UI: file input are not required in form data #773
Sets default maxWorkers based on OMP_NUM_THREADS #776
Support non-gpu models for huggingface #772
Use huggingface standard generation for tnx streaming #778
Add trust remote code option #781
Handles invalid retrun type case #790
Add application/jsonlines as content-type for streaming #791
Fixes trust_remote_code issue #793
Add einops for supporting falcon models #792
Adds content-type response for DeepSpeed and FasterTransformer handler #797
Sets default maxWorkers the same as earlier version #799
Add stream generation for huggingface streamer #801
Add server side batching #795
Add safetensors #808
Improvements in AOT UX #787
Add pytorch kernel cache default directory #810
Improves partition script error message #826
Add -XX:-UseContainerSupport flag only for SageMaker #868
Move TP detection logic to PyModel from LmiUtils #840
Set tensor_parallel_degree property when not specified #847
Add workflow dispatch #870
Create model level virtualenv #811
Refactor createVirtualEnv() #875
Add MPI Engine as generic name for distributed environment #882
Raise inference failure exceptions in default handlers #883
Increase default max_rolling_batch_size to 32 #893
Reformat python code #895
Add oom unit tests for load and invoke #898
Reformat python code #917
Refactor output formatter for rolling batch #916
Temporary workaround for rolling batch #922
Fixes huggingface logging bug #924
Adds batch size metric #925
Only override minWorkers when tp > 1 #930
Set default maxWorkers to 1 if not configured for TP #934
Send error message in json format #939
Add null check for prefill batch #938
Allow overriding truncate parameter in request #957
Add revision as part of the model inputs #947
Add revision in test #948
Add model revision environment variable #949
Disconnect client when streaming timed out #941

Docker enhancements

Fixes fastertransformer docker file #671
update fastertransformers build instruction #722
Uses the same convention as tritonserver #738
Pin bitsandbytes version to 0.38.1 #754
Avoid auto setting OMP_NUM_THREADS for GPU/INF docker images #774
Add llama support and integration tests #844
Add missing default argument to gpt2-xl sm endpoint test #846
Add protobuf to FT and TNX #850
Update netty for cve #859
Add 4 bits loading #867
Add flash attention installation and a few bug fixing #872
Allows mpi model load multiple times on the same GPU #894
Upgrade fastertransfomers HF versions #911
Update deepspeed docker to nightly wheel #915
Bump bitsandbytes versions #936
Bump up bitsandbytes on its fixes #944
Update release version and wheels #956
Adding back S3Url for backward compatibility in pysdk #838
Add neuronx 2.11.0 support #848

Bug Fixes

Fix the start gpu bug #709
tokenizer bug fixes #732
Fixes bug in fastertransformer built-in handler #736
Fixes typo in fastertransformer handler #740
bump versions for new deepspeed wheel #733
Fix bitsandbytes pip install #758
Fix the stream generation #794
Fixes typo in transformers-neuronx.py #796
Fixes device id mismatch issue for mutlple GPU case #800
Fixes device mismatch issue for streaming token #805
Fixes typo in sm workflow inputs #807
Fix input_data and device order for streaming #809
Fixes retry_threshold bug #812
Fixes huggingface device bugs #813
Fixes huggingface handler typo #815
Fixes invlid device issue #816
Fixes WorkerThread name #817
partition script: keep 'option' in properties #819
Fixes streaming token device mismatch bug #822
Extract .py files recursively #821
Fixes the device mapping issue if visible devices is set #707
Efficiency issue #841
Fix the type of max_seq_len #853
Remove option prefix when auto setting tensor_parallel_degree in properites #854
Fix T5 model not support INT8 issue on handler #856
Fix a few pipeline issues #876
Fix Kwargs in AutoConfig #885
Fix lmi-dist batch handling #887
Fix rolling batch type #897
Fix PublisherBytesSupplier #905
Fix skip_special_tokens flag #907
Fix skip special tokens in lmi-dist #908
Fix do sample type #920
Add current device for tp > 1 scenario on huggingface handler #927
Fix for empty tensor input #928
Fix boolean kwargs and typo in load_in_4_bit assignment #946
Fix some issues with remote code for lora #952
Fixes unittest in multi-GPU case #874
Fixes rolling batch error handling case #919
Fixes MPI engine workers detection #886
Fix repeated output for rolling batch #935
Fix the default value for rolling batch request parameters #943
Fixes logging bug #937
Fix remove models in step #834
Fix in runtime kv_cache #923

Documentation

Adding project diagrams link to architecture.md #742
Updates management api document #814
OOM management doc #926
Updates model configuration document #933
Adds s5cmd feature in document #945
Adds document about venv per model #951
Update docs to djl 0.23.0 #955

CI improvements

Fixes unit test for extra data type #673
Adds performance testing #558
Add small fixes #684
Add HuggingFace TGI publish and test pipeline #650
Add shared memory arg to docker launch command in README #685
Update github-slug-action to v4.4.1 #686
Change the bucket for different object #691
make performance tests run in parallel #690
Add more models to TGI test pipeline #695
Upgrade spotbugs to 5.0.14 #704
reconfigure performance test time and machines #711
Add unit test for empty model store initialization #716
Fix no code tests in lmi test suite #717
Refactor test code client.py #721
Add seq2seq streaming integ test #724
[test] Update tranformser-neuxornx gpt-j-b mode options #723
Remove TGI build and test pipeline #735
Upgrade jacoco to 0.8.8 to support JDK17+ #739
Avoid unit-test hang #744
update the wheel to have path fixed #747
Add SageMaker integration test #705
fix permissions for sm pysdk install script #751
SM AOT Tests #756
Add mme tests to sagemaker tests #763
add triton components in the nightly #767
fix typos with get default bucket prefix for sm session #768
Upload SM benchmark metrics to cloudwatch #769
Fixes integration test #779
[python] Adjuests mpi workers based CUDA_VISIBLE_DEVICES #782
Option to run only the lmi tests needed #786
remove inf1 support and upgrade some package versions #785
Remove hardcoded version in Assertion error #789
Add support for testing nightly images in sagemaker endpoint tests #788
Check if input is empty #798
update gpu memory consumption and adding GPTNeoX, GPTJ #818
Remove flan-t5-xxl #829
Migrate sagemaker endpoint tests to us-west-2 #837
Rolling batch integration tests #866
Add lmi dist tests pipeline #869
Add deepspeed cpu build in the pipeline #873
Give longer time for building DeepSpeed container #880
Add lmi-dist integration tests #892
Add integration test for lmi-dist AutoModel #904
Add llama to performance testing #921
Lmi-dist model tests updates #918
Adding gpt-neox-20b-quantized to workflow #931
Remove oom tests for hf accelerate performance #940
Add lora tests for fastertransformer #942

Contributors

@alexkarezin
@frankfliu
@LanKing
@siddvenk
@xyang16
@tosterberg
@maaquib
@sindhuvahinis
@KexinFeng
@rohithkrn
@zachgk

New Contributors

@alexkarezin made their first contribution in #742
@bryanktliu made their first contribution in #891

Full Changelog: v0.22.1...v0.23.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.23.0 release

Key Features

Enhancement

Serving and python engine enhancements

Docker enhancements

Bug Fixes

Documentation

CI improvements

Contributors

New Contributors

Contributors