Key Changes

Large Model Inference Containers 0.27.0 release
- DeepSpeed container
  - Added DBRX and Gemma model support.
  - Provided general performance optimization.
  - Added new performance enhancing features support like Speculative Decoding.
- TensorRT-LLM container
  - Upgraded to TensorRT-LLM 0.8.0
- Transformers NeuronX container
  - Upgraded to Transformers NeuronX 2.18.0
Multi-Adapter LoRA Support
- Provided multi-adapter inference functionality in LMI DLC.
CX Usability Enhancements
- Provided a seamless migration experience across different LMI DLCs.
- Implemented the Low code No code experience.
- Supported OpenAI compatible chat completions API.

Enhancement

[LMIDist] Allow passing in ignore_eos_token param by @xyang16 in #1489
[LMIDist] Make ignore_eos_token default to false by @xyang16 in #1492
Makes netty buffer size configurable by @frankfliu in #1494
translate few vllm and trtllm params to HF format by @sindhuvahinis in #1500
Align properties parsing to be similar to java by @ydm-amazon in #1502
check for rolling batch "disable" value by @sindhuvahinis in #1506
add max model length support on vLLM by @lanking520 in #1510
Creates auto increment ID for models by @zachgk in #1109
making default dtype to fp16 for compilation by @lanking520 in #1512
Use server-provided seed if not in request params in deepspeed handler by @davidthomas426 in #1520
[Neuron][WIP] add log probs calulation in Neuron by @lanking520 in #1516
remove a bit of unused code by @ydm-amazon in #1536
Remove unused device parameter from all rolling batch classes by @ydm-amazon in #1538
dump OPTION_ env vars at startup by @siddvenk in #1541
[vLLM] add enforce eager as an option by @lanking520 in #1547
[awscurl] Prints invaild response by @frankfliu in #1550
roll back enum changes by @ydm-amazon in #1551
[awscurl] Refactor time to first byte calculation by @frankfliu in #1557
[serving] Allows configure log level at runtime by @frankfliu in #1560
Stream returns only after putting placeholder finishes by @zachgk in #1567
type hints and comments for trtllm handler by @ydm-amazon in #1558
[vLLM] add speculative configs by @lanking520 in #1553
Add rolling batch type hints: Part 1 by @ydm-amazon in #1564
Remove adapters preview flag by @zachgk in #1573
[serving] Adds model event listener by @frankfliu in #1570
[serving] Add model loading metrics by @frankfliu in #1576
Speculative decoding in LMI-Dist by @KexinFeng in #1505
type hints for scheduler rolling batch by @ydm-amazon in #1577
[serving] Uses model.intProperty() api by @frankfliu in #1582
[serving] Ignore CUDA OOM when collecting metrics by @frankfliu in #1581
Remove test as the model is incompatible with transformers upgrade by @rohithkrn in #1575
[serving] Adds rolling batch metrics by @frankfliu in #1583
[serving] Uses dimension for model metric by @frankfliu in #1587
rolling batch type hints part 3 by @ydm-amazon in #1584
[SD][vLLM] record acceptance by @lanking520 in #1586
[serving] Adds promtheus metrics support by @frankfliu in #1593
[feat] Benchmark code for speculative decoding in lmi-dist by @KexinFeng in #1591
[lmi] add generated token count to details by @siddvenk in #1600
[console] use StandardCharset instead of deprecated Charset by @siddvenk in #1601
[awscurl] add download steps to README.md by @siddvenk in #1605
[lmi][deprecated] remove option.s3url since it has been deprecated fo… by @siddvenk in #1610
[serving] Skip testPrometheusMetrics when run in IDE by @frankfliu in #1611
Use workflow template for workflow model_dir by @zachgk in #1612
[Partition] Remove redudant model splitting, Improve Input Model Parsing by @a-ys in #1609
Add handler for new lmi-dist by @rohithkrn in #1595
[lmi] add parameter to allow full text including prompt to be returne… by @siddvenk in #1602
support cuda driver on sagemaker by @lanking520 in #1618
remove checker for awq with enforce eager by @lanking520 in #1620
Add pytorch-gpu for security patching by @maaquib in #1621
Refactor vllm and rubikon engine rolling batch by @rohithkrn in #1623
Update TRT-LLM Dockerfile for v0.8.0 by @nskool in #1622
[UX] sampling with vllm by @sindhuvahinis in #1624
[vLLM] reduce speculative decoding gpu util to leave room for draft model by @lanking520 in #1628
[lmi] update auto engine logic for vllm and lmi-dist by @siddvenk in #1617
[python] Encode error in single line for jsonlines case. by @frankfliu in #1630
Single model adapter API by @zachgk in #1616
remove all current no-code test cases by @siddvenk in #1635
Update the build script to use vLLM 0.3.3 by @lanking520 in #1637
Update lmi-dist rolling batch to use rubikon engine by @rohithkrn in #1639
Adds adapter registration options by @zachgk in #1634
Supports vLLM LoRA adapters by @zachgk in #1633
add customer required field by @lanking520 in #1640
[tnx] bump optimum version by @tosterberg in #1632
Updates dependencies version to latest by @frankfliu in #1647
updated dependencies for LMI by @lanking520 in #1648
[DO NOT MERGE][CAN APPROVE]change flash attn url by @lanking520 in #1650
[cache] Remove gson from fatjar of cache by @frankfliu in #1649
[python] Move output formatter to request level by @xyang16 in #1644
[tnx] improve model partitioning time by @tosterberg in #1652
[tnx] support codellama 70b instruct tokenizer by @tosterberg in #1653
[python] Remove output_formatter from vllm and lmi-dist sampling para… by @xyang16 in #1654
[wlm] Makes generateHuggingFaceConfigUri public by @frankfliu in #1656
[tnx] fix output formatter as param implementation by @tosterberg in #1657
[lmi] use hf token to get model config for gated/private models by @siddvenk in #1658
[UX] Changing some default parameters by @sindhuvahinis in #1659
add parameters to part of the field by @lanking520 in #1655
Support max_dynamic_batch_size property name by @sindhuvahinis in #1662
[lmi] add stream parameter to enable per request streaming config by @siddvenk in #1666
[Optimum] Change the default value of temperature by @sindhuvahinis in #1668
[TRTLLM] Add entrypoint for SM Neo AOT compilation by @ethnzhng in #1665
Support chat completions API by @xyang16 in #1604
[lmi][lcnc] fallback to accelerate backend when non text-generation m… by @siddvenk in #1667
support streaming for chat completions, refactor chat parsing to stan… by @siddvenk in #1674
Increase test timeout for TRT-LLM compilation and tokenizer fix by @nskool in #1672
Add lora support in lmi-dist by @rohithkrn in #1664
install huggingface_hub for lmi-dist lora test by @rohithkrn in #1679
[serving] Allows use OPTION_ENGINE env var by @frankfliu in #1676
[lmi-dist]validate usage lora and speculative together by @rohithkrn in #1684
[lmi] log warnings for unused generation parameters across all rollin… by @siddvenk in #1686
[chat][lmi] restrict chat completions support to rolling batch use-cases by @siddvenk in #1687
[TRTLLM] Add env var for Neo cache by @ethnzhng in #1683
deprecate usage of HF_TRUST_REMOTE_CODE by @siddvenk in #1691
[ci][lcnc] add trt-llm container to no code tests by @siddvenk in #1692
[serving] Separate download draft model from downloadS3() by @frankfliu in #1693
[TrtLLM] Python backend support for T5 model by @sindhuvahinis in #1680
[TrtLLM] Support JIT compilation and dynamic batch for TrtLLM python backend by @sindhuvahinis in #1678
[lcnc] support dbrx model by @siddvenk in #1695
add get_tokenizer for ds and scheduler rolling batch by @siddvenk in #1701
pass trust_remote_code and revision to from_pretrained methods that a… by @siddvenk in #1698
Adds additional adapter tests by @zachgk in #1696
[python] Update chat properties by @xyang16 in #1709
[python] Set do_sample from temperature in chat completions by @xyang16 in #1713
[ci][lcnc] do not skip jobs if previous job fails by @siddvenk in #1715
Supports adapters in download dir by @zachgk in #1690
[python] Set repetition_penalty from presence_penalty in chat completions by @xyang16 in #1717
[feat] Build flash-attn from fork which allows smaller block-sizes for paged-attn by @maaquib in #1689
[TRTLLM] T5 python backend log_probs support by @sindhuvahinis in #1697
[python] Allows send more than 32k utf-8 string by @frankfliu in #1719
[lcnc] add starcoder2 model type to auto configure list by @siddvenk in #1722
[WIP][Do Not Merge]check for machine type from given information by @lanking520 in #1721
[TRTLLM] Remove gptneox from LCNC trtllm tests by @nskool in #1725
[trtllm] Do not translate presence_penalty in TRTLLM by @xyang16 in #1730
[awscurl] Add p50 to TTFB by @frankfliu in #1745
ensure the parameters returned in details are the same parameters use… by @siddvenk in #1748
[Neo] Create neo_utils file for shared functions by @a-ys in #1726
[tnx] add safetensor loading for aot compilation by @tosterberg in #1744
[lcnc] make default rolling batch size 256 for vllm, lmi-dist by @siddvenk in #1750
[lcnc] make lmi-dist the default for all supported architectures by @siddvenk in #1751
[Neo] Add Neo Neuron entrypoint script by @a-ys in #1752
[awscurl] Support OpenAI style response by @frankfliu in #1754
[tnx] support parse input func and seed param by @tosterberg in #1757
[chat] Add role for the first token in the chat completions streaming response by @xyang16 in #1761

Known Issues

TensorRT-LLM container
- CodeLLAMA TP8 compilation sometimes failed
- Mistral 7B and Mixtral 8x-7B has correctness issues for compilation under TP4 and TP8, please do TP1 or TP2 to mitigate the correctness issues

Bug Fixes

Fix trtllm check logic by @rohithkrn in #1501
[serving] Fixes flaky unittest by @frankfliu in #1508
[test] Fixes model server flaky test by @frankfliu in #1523
[fix][ci] avoid early exit of script for failure case by @siddvenk in #1530
[serving] Fixes testThrottle() flaky test by @frankfliu in #1540
[fix] NeuronRollingBatch jsonlines integration by @tosterberg in #1548
[serving] Fixes HF_REVISION env variable by @frankfliu in #1572
[vLLM] fix awq OOM issue due to CUDA compat by @lanking520 in #1592
[specdec] Fix of lmi-dist-rollingbatch by @KexinFeng in #1590
Workflow Templates by @zachgk in #1594
add type hints neuron part 1 by @ydm-amazon in #1598
[awscurl] Fixes aws coral stream issue by @frankfliu in #1613
[partition] fix checking for model files by @a-ys in #1636
[ci][fix] remove no code test from stop runner condition by @siddvenk in #1641
[fix] Update lmi_dist handler to account for engine's Request api change, adapters fix and temperature param fix by @maaquib in #1645
fix a bug with output formatter by @lanking520 in #1661
output_formatter bug fix by @xyang16 in #1663
[Fix] set batch size property for handlers by @sindhuvahinis in #1669
[fix] restrict per request streaming to rolling batch use-cases by @siddvenk in #1670
[fix][partition] fix partition script due to missing python files by @siddvenk in #1677
fix a bug in partition by @lanking520 in #1682
Fix incorrect assignment in shell script by @nskool in #1685
fix typo in ds handler for lora models by @siddvenk in #1707
fix params for from_config model instantiation by @siddvenk in #1708
[fix] install huggingface_hub for vllm lora test by @maaquib in #1711
[python] Fix chat completions logprobs output by @xyang16 in #1712
[fix] Fix logprobs format in chat completions by @xyang16 in #1714
[lmi-dist] fix validation of mpi in properties validation by @siddvenk in #1723
[fix] fix custom input and output formatting by @siddvenk in #1728
[fix] Fix integration test for chat completions by @xyang16 in #1731
[TRTLLM Python backend]Fix the output format for client side batching in dynamic batch by @sindhuvahinis in #1718
fix parse_input signature for backward compatibility by @sindhuvahinis in #1733
fix naming of some inconsistent fields in output formatters by @siddvenk in #1747
fix issue with duplicate model loading for local hf model on sagemaker by @siddvenk in #1753

Documentation

Update trtllm_manual_convert_tutorial.md by @marckarp in #1498
[doc] add lmi input output schema doc by @sindhuvahinis in #1499
[doc] fix vllm engine by @sindhuvahinis in #1511
change line in documentation examples (vllm serving.properties) by @ydm-amazon in #1543
update random doc by @ydm-amazon in #1544
[docs][lmi] updating structure for lmi docs by @siddvenk in #1533
[Docs] vllm docs by @lanking520 in #1534
[Docs]TRT-LLM user guide by @rohithkrn in #1546
Update trt-llm docs by @rohithkrn in #1562
[docs] Remove model_dir from document by @frankfliu in #1561
[docs] add deepspeed user guide by @siddvenk in #1563
[docs] Add lmi-dist user guide by @maaquib in #1569
[docs] Add TNX AOT docs for Llama-2-70B by @tosterberg in #1545
[docs] Updates docker readme. by @frankfliu in #1571
[docs] TNX user guide by @tosterberg in #1555
[docs] add readme for user guides, remove unused user guides by @siddvenk in #1574
add test model tutorials by @lanking520 in #1585
[docs] add deployment guide section for deploying model to endpoint by @siddvenk in #1568
add LMI feature matrix by @lanking520 in #1559
[doc] adding backend selection guide by @siddvenk in #1588
[docs] add configuration type to advanced configurations to inform us… by @siddvenk in #1589
[docs] add configuration doc to deployment guide by @siddvenk in #1578
[docs] Updates offline mode document by @frankfliu in #1596
[docs][lmi] fix doc links and update formatting for lists to be compa… by @siddvenk in #1599
bad folder name by @jimburtoft in #1607
[docs] Update OPTION_MODEL_ID usage by @frankfliu in #1608
[docs] add benchmarking guide for lmi by @siddvenk in #1606
[docs][lmi] standardize structure of backend user guides by @siddvenk in #1625
[lmi][docs] replace old lmi docs with new lmi docs by @siddvenk in #1626
[docs] fix some links that do not work on mkdocs site by @siddvenk in #1627
[docs][lmi] update landing page sample notebook links by @siddvenk in #1660
[docs] Update docs to DJL 0.27.0 by @xyang16 in #1705
add mpi conceptual guide by @lanking520 in #1675
Update lmi-dist docs v9 by @rohithkrn in #1706
[docs][lmi] update guidance on advanced configurations by @siddvenk in #1716
[docs] Minor updates to instance-type-selection doc by @maaquib in #1737
nits by @ydm-amazon in #1741
[docs] Update LMI conceptual guide by @xyang16 in #1736
[doc] Add snapshot download to storing models in s3 doc by @tosterberg in #1739
[Docs] Update Endpoint Deployment guide to specify advanced config options by @nskool in #1740
[docs][lmi] update input/output schema doc by @siddvenk in #1743
[doc] update testing_custom_script.md by @sindhuvahinis in #1742
[Docs]Add link to configuration in benchmarking doc by @rohithkrn in #1738
[docs][lmi] add huggingface accelerate user guide by @siddvenk in #1755
[docs][lcnc] update documentation with lcnc user journey by @siddvenk in #1756

CI/CD

[docker] Upgrade aiccl version to 1.1 by @xyang16 in #1491
Updates DJL version to 0.27.0 by @siddvenk in #1493
[console] Updates axios to 1.6.5 by @frankfliu in #1496
test different options of rolling batch option for trt-llm by @rohithkrn in #1504
[test] print http message when assertion fail by @frankfliu in #1503
[CI][IB] Supports cloudwatch saving by @zachgk in #1474
[ci] publish scheduled workflow failures to cloudwatch for monitoring by @siddvenk in #1513
[ci] fix action metric publish on failure by @siddvenk in #1514
upgrade torch neuronx following the 2.16.0 guideline by @lanking520 in #1517
[ci] Upgrade github actions nodejs 16 to nodejs 2 by @frankfliu in #1522
[ci] remove partition steps for deepspeed/hf based models by @siddvenk in #1524
[CI][IB] Specify container through template by @zachgk in #1519
[ci] Upgrade codeql-actions to v3 by @frankfliu in #1526
[ci] Upgrade aws-actions/configure-aws-credentials to v4 by @frankfliu in #1525
[ci] refactor cloudwatch metric publishing to avoid needing changes i… by @siddvenk in #1527
[ci] move cw publish step to github hosted runner by @siddvenk in #1528
[CI][IB] Benchmark TGI Models by @zachgk in #1529
[docker] Adds performance tuning env var for aarch64 by @frankfliu in #1531
[ci] Fix gpt-j timeout issues in inf2 integration by @tosterberg in #1535
[docker] HUGGINGFACE_HUB_CACHE is deprecated by @frankfliu in #1542
[CI] add vllm 0.3.1 into deps build by @lanking520 in #1549
[Test] build test handler by @lanking520 in #1537
[CI][Deps build] upgrade torch to 2.1.2 by @lanking520 in #1552
[docker][TNX] Upgrade to 2.17.0 SDK by @tosterberg in #1556
[docker] map select tgi env vars to lmi env vars by @siddvenk in #1554
Updates vllm to 0.3.1 by @zachgk in #1539
[docker] Fixes typo in java properties by @frankfliu in #1580
update vllm to 0.3.2 by @lanking520 in #1579
[CI] replace chatglm with s3 model by @ydm-amazon in #1597
[ci] Publish prometheus to maven by @frankfliu in #1643
[ci][lmi] add new no code tests by @siddvenk in #1642
[cache] Updates DynamoDBLocal to 2.3.0 by @frankfliu in #1646
[ci] Updating lmi-dist ci tests for rubikon-engine by @maaquib in #1651
[CI] Fixes vllm unmerged LoRA tests by @zachgk in #1673
[docker] translate HF_MODEL_TRUST_REMOTE_CODE to OPTION_TRUST_REMOTE_CODE by @siddvenk in #1688
Upgrade dependency version by @xyang16 in #1694
Upgrade to DJL 0.27.0 by @xyang16 in #1702
[docker] Updates version to 0.27.0 by @xyang16 in #1710
[neuron] Update to 2.18.0 SDK by @tosterberg in #1729
[T5] Add integration test cases by @sindhuvahinis in #1732
test lcnc on g6 by @lanking520 in #1746
Test t5-xl for lcnc by @sindhuvahinis in #1749
add blobfile as dependency by @lanking520 in #1758
[CI] add extra wait time for TRTLLM conversion by @lanking520 in #1759

New Contributors

@marckarp made their first contribution in #1498
@jimburtoft made their first contribution in #1607
@nskool made their first contribution in #1622
@ethnzhng made their first contribution in #1665

Full Changelog: v0.26.0...v0.27.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.27.0 Release

Key Changes

Enhancement

Known Issues

Bug Fixes

Documentation

CI/CD

New Contributors

Contributors