DJLServing v0.27.0 Release
Key Changes
- Large Model Inference Containers 0.27.0 release
- DeepSpeed container
- Added DBRX and Gemma model support.
- Provided general performance optimization.
- Added new performance enhancing features support like Speculative Decoding.
- TensorRT-LLM container
- Upgraded to TensorRT-LLM 0.8.0
- Transformers NeuronX container
- Upgraded to Transformers NeuronX 2.18.0
- DeepSpeed container
- Multi-Adapter LoRA Support
- Provided multi-adapter inference functionality in LMI DLC.
- CX Usability Enhancements
- Provided a seamless migration experience across different LMI DLCs.
- Implemented the Low code No code experience.
- Supported OpenAI compatible chat completions API.
Enhancement
- [LMIDist] Allow passing in ignore_eos_token param by @xyang16 in #1489
- [LMIDist] Make ignore_eos_token default to false by @xyang16 in #1492
- Makes netty buffer size configurable by @frankfliu in #1494
- translate few vllm and trtllm params to HF format by @sindhuvahinis in #1500
- Align properties parsing to be similar to java by @ydm-amazon in #1502
- check for rolling batch "disable" value by @sindhuvahinis in #1506
- add max model length support on vLLM by @lanking520 in #1510
- Creates auto increment ID for models by @zachgk in #1109
- making default dtype to fp16 for compilation by @lanking520 in #1512
- Use server-provided seed if not in request params in deepspeed handler by @davidthomas426 in #1520
- [Neuron][WIP] add log probs calulation in Neuron by @lanking520 in #1516
- remove a bit of unused code by @ydm-amazon in #1536
- Remove unused device parameter from all rolling batch classes by @ydm-amazon in #1538
- dump OPTION_ env vars at startup by @siddvenk in #1541
- [vLLM] add enforce eager as an option by @lanking520 in #1547
- [awscurl] Prints invaild response by @frankfliu in #1550
- roll back enum changes by @ydm-amazon in #1551
- [awscurl] Refactor time to first byte calculation by @frankfliu in #1557
- [serving] Allows configure log level at runtime by @frankfliu in #1560
- Stream returns only after putting placeholder finishes by @zachgk in #1567
- type hints and comments for trtllm handler by @ydm-amazon in #1558
- [vLLM] add speculative configs by @lanking520 in #1553
- Add rolling batch type hints: Part 1 by @ydm-amazon in #1564
- Remove adapters preview flag by @zachgk in #1573
- [serving] Adds model event listener by @frankfliu in #1570
- [serving] Add model loading metrics by @frankfliu in #1576
- Speculative decoding in LMI-Dist by @KexinFeng in #1505
- type hints for scheduler rolling batch by @ydm-amazon in #1577
- [serving] Uses model.intProperty() api by @frankfliu in #1582
- [serving] Ignore CUDA OOM when collecting metrics by @frankfliu in #1581
- Remove test as the model is incompatible with transformers upgrade by @rohithkrn in #1575
- [serving] Adds rolling batch metrics by @frankfliu in #1583
- [serving] Uses dimension for model metric by @frankfliu in #1587
- rolling batch type hints part 3 by @ydm-amazon in #1584
- [SD][vLLM] record acceptance by @lanking520 in #1586
- [serving] Adds promtheus metrics support by @frankfliu in #1593
- [feat] Benchmark code for speculative decoding in lmi-dist by @KexinFeng in #1591
- [lmi] add generated token count to details by @siddvenk in #1600
- [console] use StandardCharset instead of deprecated Charset by @siddvenk in #1601
- [awscurl] add download steps to README.md by @siddvenk in #1605
- [lmi][deprecated] remove option.s3url since it has been deprecated fo… by @siddvenk in #1610
- [serving] Skip testPrometheusMetrics when run in IDE by @frankfliu in #1611
- Use workflow template for workflow model_dir by @zachgk in #1612
- [Partition] Remove redudant model splitting, Improve Input Model Parsing by @a-ys in #1609
- Add handler for new lmi-dist by @rohithkrn in #1595
- [lmi] add parameter to allow full text including prompt to be returne… by @siddvenk in #1602
- support cuda driver on sagemaker by @lanking520 in #1618
- remove checker for awq with enforce eager by @lanking520 in #1620
- Add pytorch-gpu for security patching by @maaquib in #1621
- Refactor vllm and rubikon engine rolling batch by @rohithkrn in #1623
- Update TRT-LLM Dockerfile for v0.8.0 by @nskool in #1622
- [UX] sampling with vllm by @sindhuvahinis in #1624
- [vLLM] reduce speculative decoding gpu util to leave room for draft model by @lanking520 in #1628
- [lmi] update auto engine logic for vllm and lmi-dist by @siddvenk in #1617
- [python] Encode error in single line for jsonlines case. by @frankfliu in #1630
- Single model adapter API by @zachgk in #1616
- remove all current no-code test cases by @siddvenk in #1635
- Update the build script to use vLLM 0.3.3 by @lanking520 in #1637
- Update lmi-dist rolling batch to use rubikon engine by @rohithkrn in #1639
- Adds adapter registration options by @zachgk in #1634
- Supports vLLM LoRA adapters by @zachgk in #1633
- add customer required field by @lanking520 in #1640
- [tnx] bump optimum version by @tosterberg in #1632
- Updates dependencies version to latest by @frankfliu in #1647
- updated dependencies for LMI by @lanking520 in #1648
- [DO NOT MERGE][CAN APPROVE]change flash attn url by @lanking520 in #1650
- [cache] Remove gson from fatjar of cache by @frankfliu in #1649
- [python] Move output formatter to request level by @xyang16 in #1644
- [tnx] improve model partitioning time by @tosterberg in #1652
- [tnx] support codellama 70b instruct tokenizer by @tosterberg in #1653
- [python] Remove output_formatter from vllm and lmi-dist sampling para… by @xyang16 in #1654
- [wlm] Makes generateHuggingFaceConfigUri public by @frankfliu in #1656
- [tnx] fix output formatter as param implementation by @tosterberg in #1657
- [lmi] use hf token to get model config for gated/private models by @siddvenk in #1658
- [UX] Changing some default parameters by @sindhuvahinis in #1659
- add parameters to part of the field by @lanking520 in #1655
- Support max_dynamic_batch_size property name by @sindhuvahinis in #1662
- [lmi] add stream parameter to enable per request streaming config by @siddvenk in #1666
- [Optimum] Change the default value of temperature by @sindhuvahinis in #1668
- [TRTLLM] Add entrypoint for SM Neo AOT compilation by @ethnzhng in #1665
- Support chat completions API by @xyang16 in #1604
- [lmi][lcnc] fallback to accelerate backend when non text-generation m… by @siddvenk in #1667
- support streaming for chat completions, refactor chat parsing to stan… by @siddvenk in #1674
- Increase test timeout for TRT-LLM compilation and tokenizer fix by @nskool in #1672
- Add lora support in lmi-dist by @rohithkrn in #1664
- install huggingface_hub for lmi-dist lora test by @rohithkrn in #1679
- [serving] Allows use OPTION_ENGINE env var by @frankfliu in #1676
- [lmi-dist]validate usage lora and speculative together by @rohithkrn in #1684
- [lmi] log warnings for unused generation parameters across all rollin… by @siddvenk in #1686
- [chat][lmi] restrict chat completions support to rolling batch use-cases by @siddvenk in #1687
- [TRTLLM] Add env var for Neo cache by @ethnzhng in #1683
- deprecate usage of HF_TRUST_REMOTE_CODE by @siddvenk in #1691
- [ci][lcnc] add trt-llm container to no code tests by @siddvenk in #1692
- [serving] Separate download draft model from downloadS3() by @frankfliu in #1693
- [TrtLLM] Python backend support for T5 model by @sindhuvahinis in #1680
- [TrtLLM] Support JIT compilation and dynamic batch for TrtLLM python backend by @sindhuvahinis in #1678
- [lcnc] support dbrx model by @siddvenk in #1695
- add get_tokenizer for ds and scheduler rolling batch by @siddvenk in #1701
- pass trust_remote_code and revision to from_pretrained methods that a… by @siddvenk in #1698
- Adds additional adapter tests by @zachgk in #1696
- [python] Update chat properties by @xyang16 in #1709
- [python] Set do_sample from temperature in chat completions by @xyang16 in #1713
- [ci][lcnc] do not skip jobs if previous job fails by @siddvenk in #1715
- Supports adapters in download dir by @zachgk in #1690
- [python] Set repetition_penalty from presence_penalty in chat completions by @xyang16 in #1717
- [feat] Build flash-attn from fork which allows smaller block-sizes for paged-attn by @maaquib in #1689
- [TRTLLM] T5 python backend log_probs support by @sindhuvahinis in #1697
- [python] Allows send more than 32k utf-8 string by @frankfliu in #1719
- [lcnc] add starcoder2 model type to auto configure list by @siddvenk in #1722
- [WIP][Do Not Merge]check for machine type from given information by @lanking520 in #1721
- [TRTLLM] Remove gptneox from LCNC trtllm tests by @nskool in #1725
- [trtllm] Do not translate presence_penalty in TRTLLM by @xyang16 in #1730
- [awscurl] Add p50 to TTFB by @frankfliu in #1745
- ensure the parameters returned in details are the same parameters use… by @siddvenk in #1748
- [Neo] Create neo_utils file for shared functions by @a-ys in #1726
- [tnx] add safetensor loading for aot compilation by @tosterberg in #1744
- [lcnc] make default rolling batch size 256 for vllm, lmi-dist by @siddvenk in #1750
- [lcnc] make lmi-dist the default for all supported architectures by @siddvenk in #1751
- [Neo] Add Neo Neuron entrypoint script by @a-ys in #1752
- [awscurl] Support OpenAI style response by @frankfliu in #1754
- [tnx] support parse input func and seed param by @tosterberg in #1757
- [chat] Add role for the first token in the chat completions streaming response by @xyang16 in #1761
Known Issues
- TensorRT-LLM container
- CodeLLAMA TP8 compilation sometimes failed
- Mistral 7B and Mixtral 8x-7B has correctness issues for compilation under TP4 and TP8, please do TP1 or TP2 to mitigate the correctness issues
Bug Fixes
- Fix trtllm check logic by @rohithkrn in #1501
- [serving] Fixes flaky unittest by @frankfliu in #1508
- [test] Fixes model server flaky test by @frankfliu in #1523
- [fix][ci] avoid early exit of script for failure case by @siddvenk in #1530
- [serving] Fixes testThrottle() flaky test by @frankfliu in #1540
- [fix] NeuronRollingBatch jsonlines integration by @tosterberg in #1548
- [serving] Fixes HF_REVISION env variable by @frankfliu in #1572
- [vLLM] fix awq OOM issue due to CUDA compat by @lanking520 in #1592
- [specdec] Fix of lmi-dist-rollingbatch by @KexinFeng in #1590
- Workflow Templates by @zachgk in #1594
- add type hints neuron part 1 by @ydm-amazon in #1598
- [awscurl] Fixes aws coral stream issue by @frankfliu in #1613
- [partition] fix checking for model files by @a-ys in #1636
- [ci][fix] remove no code test from stop runner condition by @siddvenk in #1641
- [fix] Update lmi_dist handler to account for engine's Request api change, adapters fix and temperature param fix by @maaquib in #1645
- fix a bug with output formatter by @lanking520 in #1661
- output_formatter bug fix by @xyang16 in #1663
- [Fix] set batch size property for handlers by @sindhuvahinis in #1669
- [fix] restrict per request streaming to rolling batch use-cases by @siddvenk in #1670
- [fix][partition] fix partition script due to missing python files by @siddvenk in #1677
- fix a bug in partition by @lanking520 in #1682
- Fix incorrect assignment in shell script by @nskool in #1685
- fix typo in ds handler for lora models by @siddvenk in #1707
- fix params for from_config model instantiation by @siddvenk in #1708
- [fix] install huggingface_hub for vllm lora test by @maaquib in #1711
- [python] Fix chat completions logprobs output by @xyang16 in #1712
- [fix] Fix logprobs format in chat completions by @xyang16 in #1714
- [lmi-dist] fix validation of mpi in properties validation by @siddvenk in #1723
- [fix] fix custom input and output formatting by @siddvenk in #1728
- [fix] Fix integration test for chat completions by @xyang16 in #1731
- [TRTLLM Python backend]Fix the output format for client side batching in dynamic batch by @sindhuvahinis in #1718
- fix parse_input signature for backward compatibility by @sindhuvahinis in #1733
- fix naming of some inconsistent fields in output formatters by @siddvenk in #1747
- fix issue with duplicate model loading for local hf model on sagemaker by @siddvenk in #1753
Documentation
- Update trtllm_manual_convert_tutorial.md by @marckarp in #1498
- [doc] add lmi input output schema doc by @sindhuvahinis in #1499
- [doc] fix vllm engine by @sindhuvahinis in #1511
- change line in documentation examples (vllm serving.properties) by @ydm-amazon in #1543
- update random doc by @ydm-amazon in #1544
- [docs][lmi] updating structure for lmi docs by @siddvenk in #1533
- [Docs] vllm docs by @lanking520 in #1534
- [Docs]TRT-LLM user guide by @rohithkrn in #1546
- Update trt-llm docs by @rohithkrn in #1562
- [docs] Remove model_dir from document by @frankfliu in #1561
- [docs] add deepspeed user guide by @siddvenk in #1563
- [docs] Add lmi-dist user guide by @maaquib in #1569
- [docs] Add TNX AOT docs for Llama-2-70B by @tosterberg in #1545
- [docs] Updates docker readme. by @frankfliu in #1571
- [docs] TNX user guide by @tosterberg in #1555
- [docs] add readme for user guides, remove unused user guides by @siddvenk in #1574
- add test model tutorials by @lanking520 in #1585
- [docs] add deployment guide section for deploying model to endpoint by @siddvenk in #1568
- add LMI feature matrix by @lanking520 in #1559
- [doc] adding backend selection guide by @siddvenk in #1588
- [docs] add configuration type to advanced configurations to inform us… by @siddvenk in #1589
- [docs] add configuration doc to deployment guide by @siddvenk in #1578
- [docs] Updates offline mode document by @frankfliu in #1596
- [docs][lmi] fix doc links and update formatting for lists to be compa… by @siddvenk in #1599
- bad folder name by @jimburtoft in #1607
- [docs] Update OPTION_MODEL_ID usage by @frankfliu in #1608
- [docs] add benchmarking guide for lmi by @siddvenk in #1606
- [docs][lmi] standardize structure of backend user guides by @siddvenk in #1625
- [lmi][docs] replace old lmi docs with new lmi docs by @siddvenk in #1626
- [docs] fix some links that do not work on mkdocs site by @siddvenk in #1627
- [docs][lmi] update landing page sample notebook links by @siddvenk in #1660
- [docs] Update docs to DJL 0.27.0 by @xyang16 in #1705
- add mpi conceptual guide by @lanking520 in #1675
- Update lmi-dist docs v9 by @rohithkrn in #1706
- [docs][lmi] update guidance on advanced configurations by @siddvenk in #1716
- [docs] Minor updates to instance-type-selection doc by @maaquib in #1737
- nits by @ydm-amazon in #1741
- [docs] Update LMI conceptual guide by @xyang16 in #1736
- [doc] Add snapshot download to storing models in s3 doc by @tosterberg in #1739
- [Docs] Update Endpoint Deployment guide to specify advanced config options by @nskool in #1740
- [docs][lmi] update input/output schema doc by @siddvenk in #1743
- [doc] update testing_custom_script.md by @sindhuvahinis in #1742
- [Docs]Add link to configuration in benchmarking doc by @rohithkrn in #1738
- [docs][lmi] add huggingface accelerate user guide by @siddvenk in #1755
- [docs][lcnc] update documentation with lcnc user journey by @siddvenk in #1756
CI/CD
- [docker] Upgrade aiccl version to 1.1 by @xyang16 in #1491
- Updates DJL version to 0.27.0 by @siddvenk in #1493
- [console] Updates axios to 1.6.5 by @frankfliu in #1496
- test different options of rolling batch option for trt-llm by @rohithkrn in #1504
- [test] print http message when assertion fail by @frankfliu in #1503
- [CI][IB] Supports cloudwatch saving by @zachgk in #1474
- [ci] publish scheduled workflow failures to cloudwatch for monitoring by @siddvenk in #1513
- [ci] fix action metric publish on failure by @siddvenk in #1514
- upgrade torch neuronx following the 2.16.0 guideline by @lanking520 in #1517
- [ci] Upgrade github actions nodejs 16 to nodejs 2 by @frankfliu in #1522
- [ci] remove partition steps for deepspeed/hf based models by @siddvenk in #1524
- [CI][IB] Specify container through template by @zachgk in #1519
- [ci] Upgrade codeql-actions to v3 by @frankfliu in #1526
- [ci] Upgrade aws-actions/configure-aws-credentials to v4 by @frankfliu in #1525
- [ci] refactor cloudwatch metric publishing to avoid needing changes i… by @siddvenk in #1527
- [ci] move cw publish step to github hosted runner by @siddvenk in #1528
- [CI][IB] Benchmark TGI Models by @zachgk in #1529
- [docker] Adds performance tuning env var for aarch64 by @frankfliu in #1531
- [ci] Fix gpt-j timeout issues in inf2 integration by @tosterberg in #1535
- [docker] HUGGINGFACE_HUB_CACHE is deprecated by @frankfliu in #1542
- [CI] add vllm 0.3.1 into deps build by @lanking520 in #1549
- [Test] build test handler by @lanking520 in #1537
- [CI][Deps build] upgrade torch to 2.1.2 by @lanking520 in #1552
- [docker][TNX] Upgrade to 2.17.0 SDK by @tosterberg in #1556
- [docker] map select tgi env vars to lmi env vars by @siddvenk in #1554
- Updates vllm to 0.3.1 by @zachgk in #1539
- [docker] Fixes typo in java properties by @frankfliu in #1580
- update vllm to 0.3.2 by @lanking520 in #1579
- [CI] replace chatglm with s3 model by @ydm-amazon in #1597
- [ci] Publish prometheus to maven by @frankfliu in #1643
- [ci][lmi] add new no code tests by @siddvenk in #1642
- [cache] Updates DynamoDBLocal to 2.3.0 by @frankfliu in #1646
- [ci] Updating lmi-dist ci tests for rubikon-engine by @maaquib in #1651
- [CI] Fixes vllm unmerged LoRA tests by @zachgk in #1673
- [docker] translate HF_MODEL_TRUST_REMOTE_CODE to OPTION_TRUST_REMOTE_CODE by @siddvenk in #1688
- Upgrade dependency version by @xyang16 in #1694
- Upgrade to DJL 0.27.0 by @xyang16 in #1702
- [docker] Updates version to 0.27.0 by @xyang16 in #1710
- [neuron] Update to 2.18.0 SDK by @tosterberg in #1729
- [T5] Add integration test cases by @sindhuvahinis in #1732
- test lcnc on g6 by @lanking520 in #1746
- Test t5-xl for lcnc by @sindhuvahinis in #1749
- add blobfile as dependency by @lanking520 in #1758
- [CI] add extra wait time for TRTLLM conversion by @lanking520 in #1759
New Contributors
- @marckarp made their first contribution in #1498
- @jimburtoft made their first contribution in #1607
- @nskool made their first contribution in #1622
- @ethnzhng made their first contribution in #1665
Full Changelog: v0.26.0...v0.27.0