DJLServing v0.23.0 release
sindhuvahinis
released this
18 Jul 15:22
·
1334 commits
to master
since this release
Key Features
- Introduces Roling Batch
- SeqBatchScheduler with rolling batch #803
- Sampling SeqBatcher design #842
- Max Seqbatcher number threshold api #843
- Adds rolling batch support #828
- Max new length #845
- Rolling batch for huggingface handler #857
- Compute kv cache utility function #863
- Sampling decoding implementation #878
- Uses multinomial to choose from topK samples and improve topP sampling #891
- Falcon support #890
- Unit test with random seed failure #909
- KV cache support in default handler #929
- Introduces LMI Dist library for rolling batch
- Introduces vLLM library for rolling batch
- [VLLM] add vllm rolling batch and add hazard handling #877
- Introduces PEFT and LoRA support in handlers
- Introduces streaming support to FasterTransformer
- Add Streaming support #820
- Introduces S3 Cache Engine
- S3 Cache Engine #719
- Upgrades component versions:
Enhancement
Serving and python engine enhancements
- Adds workflow model loading for SageMaker #661
- Allows model being shared between workflows #665
- Prints out error message if pip install failed #666
- Install fixed version for transformers and accelerate #672
- Add numpy fix #674
- SM Training job changes for AOT #667
- Creates model dir to prevent issues with no code experience in SageMaker #675
- Don't mount model dir for no code tests #676
- AOT upload checkpoints tests #678
- Add stable diffusion support on INF2 #683
- Unset omp thread to prevent CLIP model delay #688
- Update ChunkedBytesSupplier API #692
- Fixes log file charset issue in management console #693
- Adds neuronx new feature for generation #694
- [INF2] adding clip model support #696
- [plugin] Include djl s3 extension in djl-serving distribution #699
- [INF2] add bf16 support to SD #700
- Adds support for streaming Seq2Seq models #698
- Add SageMaker MCE support #706
- [INF2] give better room for more tokens #710
- [INF2] Bump up n positions #713
- Refactor logic for supporting HF_MODEL_ID to support MME use case #712
- Support load model from workflow directory #714
- Add support for se2seq model loading in HF handler #715
- Load function from workflow directory #718
- Add vision components for DeepSpeed and inf2 #725
- Support pip install in offline mode #729
- Add --no-index to pip install in offline mode #731
- Adding llama model support #727
- Change the dependencies so for FasterTransformer #734
- Adds text/plain content-type support #741
- Skeleton structure for sequence batch scheduler #745
- Handles torch.cuda.OutOfMemoryError #749
- Improves model loading logging #750
- Asynchronous with PublisherBytesSupplier #730
- Renames env var DDB_TABLE_NAME to SERVING_DDB_TABLE_NAME #753
- Sets default minWorkers to 1 for GPU python model #755
- Fixes log message #765
- Adds more logs to LMI engine detection #766
- Uses predictable model name for HF model #771
- Adds parallel loading support for Python engine #770
- Updates management console UI: file input are not required in form data #773
- Sets default maxWorkers based on OMP_NUM_THREADS #776
- Support non-gpu models for huggingface #772
- Use huggingface standard generation for tnx streaming #778
- Add trust remote code option #781
- Handles invalid retrun type case #790
- Add application/jsonlines as content-type for streaming #791
- Fixes trust_remote_code issue #793
- Add einops for supporting falcon models #792
- Adds content-type response for DeepSpeed and FasterTransformer handler #797
- Sets default maxWorkers the same as earlier version #799
- Add stream generation for huggingface streamer #801
- Add server side batching #795
- Add safetensors #808
- Improvements in AOT UX #787
- Add pytorch kernel cache default directory #810
- Improves partition script error message #826
- Add -XX:-UseContainerSupport flag only for SageMaker #868
- Move TP detection logic to PyModel from LmiUtils #840
- Set tensor_parallel_degree property when not specified #847
- Add workflow dispatch #870
- Create model level virtualenv #811
- Refactor createVirtualEnv() #875
- Add MPI Engine as generic name for distributed environment #882
- Raise inference failure exceptions in default handlers #883
- Increase default max_rolling_batch_size to 32 #893
- Reformat python code #895
- Add oom unit tests for load and invoke #898
- Reformat python code #917
- Refactor output formatter for rolling batch #916
- Temporary workaround for rolling batch #922
- Fixes huggingface logging bug #924
- Adds batch size metric #925
- Only override minWorkers when tp > 1 #930
- Set default maxWorkers to 1 if not configured for TP #934
- Send error message in json format #939
- Add null check for prefill batch #938
- Allow overriding truncate parameter in request #957
- Add revision as part of the model inputs #947
- Add revision in test #948
- Add model revision environment variable #949
- Disconnect client when streaming timed out #941
Docker enhancements
- Fixes fastertransformer docker file #671
- update fastertransformers build instruction #722
- Uses the same convention as tritonserver #738
- Pin bitsandbytes version to 0.38.1 #754
- Avoid auto setting OMP_NUM_THREADS for GPU/INF docker images #774
- Add llama support and integration tests #844
- Add missing default argument to gpt2-xl sm endpoint test #846
- Add protobuf to FT and TNX #850
- Update netty for cve #859
- Add 4 bits loading #867
- Add flash attention installation and a few bug fixing #872
- Allows mpi model load multiple times on the same GPU #894
- Upgrade fastertransfomers HF versions #911
- Update deepspeed docker to nightly wheel #915
- Bump bitsandbytes versions #936
- Bump up bitsandbytes on its fixes #944
- Update release version and wheels #956
- Adding back S3Url for backward compatibility in pysdk #838
- Add neuronx 2.11.0 support #848
Bug Fixes
- Fix the start gpu bug #709
- tokenizer bug fixes #732
- Fixes bug in fastertransformer built-in handler #736
- Fixes typo in fastertransformer handler #740
- bump versions for new deepspeed wheel #733
- Fix bitsandbytes pip install #758
- Fix the stream generation #794
- Fixes typo in transformers-neuronx.py #796
- Fixes device id mismatch issue for mutlple GPU case #800
- Fixes device mismatch issue for streaming token #805
- Fixes typo in sm workflow inputs #807
- Fix input_data and device order for streaming #809
- Fixes retry_threshold bug #812
- Fixes huggingface device bugs #813
- Fixes huggingface handler typo #815
- Fixes invlid device issue #816
- Fixes WorkerThread name #817
- partition script: keep 'option' in properties #819
- Fixes streaming token device mismatch bug #822
- Extract .py files recursively #821
- Fixes the device mapping issue if visible devices is set #707
- Efficiency issue #841
- Fix the type of max_seq_len #853
- Remove option prefix when auto setting tensor_parallel_degree in properites #854
- Fix T5 model not support INT8 issue on handler #856
- Fix a few pipeline issues #876
- Fix Kwargs in AutoConfig #885
- Fix lmi-dist batch handling #887
- Fix rolling batch type #897
- Fix PublisherBytesSupplier #905
- Fix skip_special_tokens flag #907
- Fix skip special tokens in lmi-dist #908
- Fix do sample type #920
- Add current device for tp > 1 scenario on huggingface handler #927
- Fix for empty tensor input #928
- Fix boolean kwargs and typo in load_in_4_bit assignment #946
- Fix some issues with remote code for lora #952
- Fixes unittest in multi-GPU case #874
- Fixes rolling batch error handling case #919
- Fixes MPI engine workers detection #886
- Fix repeated output for rolling batch #935
- Fix the default value for rolling batch request parameters #943
- Fixes logging bug #937
- Fix remove models in step #834
- Fix in runtime kv_cache #923
Documentation
- Adding project diagrams link to architecture.md #742
- Updates management api document #814
- OOM management doc #926
- Updates model configuration document #933
- Adds s5cmd feature in document #945
- Adds document about venv per model #951
- Update docs to djl 0.23.0 #955
CI improvements
- Fixes unit test for extra data type #673
- Adds performance testing #558
- Add small fixes #684
- Add HuggingFace TGI publish and test pipeline #650
- Add shared memory arg to docker launch command in README #685
- Update github-slug-action to v4.4.1 #686
- Change the bucket for different object #691
- make performance tests run in parallel #690
- Add more models to TGI test pipeline #695
- Upgrade spotbugs to 5.0.14 #704
- reconfigure performance test time and machines #711
- Add unit test for empty model store initialization #716
- Fix no code tests in lmi test suite #717
- Refactor test code client.py #721
- Add seq2seq streaming integ test #724
- [test] Update tranformser-neuxornx gpt-j-b mode options #723
- Remove TGI build and test pipeline #735
- Upgrade jacoco to 0.8.8 to support JDK17+ #739
- Avoid unit-test hang #744
- update the wheel to have path fixed #747
- Add SageMaker integration test #705
- fix permissions for sm pysdk install script #751
- SM AOT Tests #756
- Add mme tests to sagemaker tests #763
- add triton components in the nightly #767
- fix typos with get default bucket prefix for sm session #768
- Upload SM benchmark metrics to cloudwatch #769
- Fixes integration test #779
- [python] Adjuests mpi workers based CUDA_VISIBLE_DEVICES #782
- Option to run only the lmi tests needed #786
- remove inf1 support and upgrade some package versions #785
- Remove hardcoded version in Assertion error #789
- Add support for testing nightly images in sagemaker endpoint tests #788
- Check if input is empty #798
- update gpu memory consumption and adding GPTNeoX, GPTJ #818
- Remove flan-t5-xxl #829
- Migrate sagemaker endpoint tests to us-west-2 #837
- Rolling batch integration tests #866
- Add lmi dist tests pipeline #869
- Add deepspeed cpu build in the pipeline #873
- Give longer time for building DeepSpeed container #880
- Add lmi-dist integration tests #892
- Add integration test for lmi-dist AutoModel #904
- Add llama to performance testing #921
- Lmi-dist model tests updates #918
- Adding gpt-neox-20b-quantized to workflow #931
- Remove oom tests for hf accelerate performance #940
- Add lora tests for fastertransformer #942
Contributors
@alexkarezin
@frankfliu
@LanKing
@siddvenk
@xyang16
@tosterberg
@maaquib
@sindhuvahinis
@KexinFeng
@rohithkrn
@zachgk
New Contributors
- @alexkarezin made their first contribution in #742
- @bryanktliu made their first contribution in #891
Full Changelog: v0.22.1...v0.23.0