Skip to content

Commit e4eabad

Browse files
committed
Merge branch 'main_307c5238' into remove-attributes-from-processors-ydshieh
2 parents 93d2c4d + 307c523 commit e4eabad

File tree

225 files changed

+504
-2830
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

225 files changed

+504
-2830
lines changed

.github/workflows/check_failed_tests.yml

Lines changed: 59 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,14 @@ env:
4141

4242
jobs:
4343
check_new_failures:
44-
name: " "
44+
name: "Find commits for new failing tests"
45+
strategy:
46+
matrix:
47+
run_idx: [1]
4548
runs-on:
4649
group: aws-g5-4xlarge-cache
50+
outputs:
51+
process: ${{ steps.check_file.outputs.process }}
4752
container:
4853
image: ${{ inputs.docker }}
4954
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@@ -54,14 +59,17 @@ jobs:
5459
path: /transformers/ci_results_${{ inputs.job }}
5560

5661
- name: Check file
62+
id: check_file
5763
working-directory: /transformers
5864
run: |
5965
if [ -f ci_results_${{ inputs.job }}/new_failures.json ]; then
6066
echo "`ci_results_${{ inputs.job }}/new_failures.json` exists, continue ..."
6167
echo "process=true" >> $GITHUB_ENV
68+
echo "process=true" >> $GITHUB_OUTPUT
6269
else
6370
echo "`ci_results_${{ inputs.job }}/new_failures.json` doesn't exist, abort."
6471
echo "process=false" >> $GITHUB_ENV
72+
echo "process=false" >> $GITHUB_OUTPUT
6573
fi
6674
6775
- uses: actions/download-artifact@v4
@@ -118,6 +126,10 @@ jobs:
118126
run: |
119127
python3 utils/print_env.py
120128
129+
- name: Install pytest-flakefinder
130+
if: ${{ env.process == 'true' }}
131+
run: python3 -m pip install pytest-flakefinder
132+
121133
- name: Show installed libraries and their versions
122134
working-directory: /transformers
123135
if: ${{ env.process == 'true' }}
@@ -126,25 +138,63 @@ jobs:
126138
- name: Check failed tests
127139
working-directory: /transformers
128140
if: ${{ env.process == 'true' }}
129-
run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit.json
141+
run: python3 utils/check_bad_commit.py --start_commit ${{ inputs.start_sha }} --end_commit ${{ env.END_SHA }} --file ci_results_${{ inputs.job }}/new_failures.json --output_file new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
130142

131143
- name: Show results
132144
working-directory: /transformers
133145
if: ${{ env.process == 'true' }}
134146
run: |
135-
ls -l new_failures_with_bad_commit.json
136-
cat new_failures_with_bad_commit.json
147+
ls -l new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
148+
cat new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
149+
150+
- name: Upload artifacts
151+
uses: actions/upload-artifact@v4
152+
with:
153+
name: new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}
154+
path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}_${{ matrix.run_idx }}.json
155+
156+
process_new_failures_with_commit_info:
157+
name: "process bad commit reports"
158+
needs: check_new_failures
159+
if: needs.check_new_failures.outputs.process == 'true'
160+
runs-on:
161+
group: aws-g5-4xlarge-cache
162+
container:
163+
image: ${{ inputs.docker }}
164+
options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
165+
steps:
166+
- uses: actions/download-artifact@v4
167+
with:
168+
name: ci_results_${{ inputs.job }}
169+
path: /transformers/ci_results_${{ inputs.job }}
137170

138-
- name: Checkout back
171+
- uses: actions/download-artifact@v4
172+
with:
173+
pattern: new_failures_with_bad_commit_${{ inputs.job }}*
174+
path: /transformers/new_failures_with_bad_commit_${{ inputs.job }}
175+
merge-multiple: true
176+
177+
- name: Check files
178+
working-directory: /transformers
179+
run: |
180+
ls -la /transformers
181+
ls -la /transformers/new_failures_with_bad_commit_${{ inputs.job }}
182+
183+
# Currently, we only run with a single runner by using `run_idx: [1]`. We might try to run with multiple runners
184+
# to further reduce the false positive caused by flaky tests, which requires further processing to merge reports.
185+
- name: Merge files
186+
shell: bash
139187
working-directory: /transformers
140-
if: ${{ env.process == 'true' }}
141188
run: |
142-
git checkout ${{ inputs.start_sha }}
189+
cp /transformers/new_failures_with_bad_commit_${{ inputs.job }}/new_failures_with_bad_commit_${{ inputs.job }}_1.json new_failures_with_bad_commit.json
190+
191+
- name: Update clone
192+
working-directory: /transformers
193+
run: git fetch && git checkout ${{ inputs.commit_sha || github.sha }}
143194

144195
- name: Process report
145196
shell: bash
146197
working-directory: /transformers
147-
if: ${{ env.process == 'true' }}
148198
env:
149199
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
150200
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@@ -156,7 +206,6 @@ jobs:
156206
- name: Process report
157207
shell: bash
158208
working-directory: /transformers
159-
if: ${{ env.process == 'true' }}
160209
env:
161210
ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }}
162211
TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN: ${{ secrets.TRANSFORMERS_CI_RESULTS_UPLOAD_TOKEN }}
@@ -171,13 +220,12 @@ jobs:
171220
172221
- name: Prepare Slack report title
173222
working-directory: /transformers
174-
if: ${{ env.process == 'true' }}
175223
run: |
176224
pip install slack_sdk
177225
echo "title=$(python3 -c 'import sys; sys.path.append("utils"); from utils.notification_service import job_to_test_map; ci_event = "${{ inputs.ci_event }}"; job = "${{ inputs.job }}"; test_name = job_to_test_map[job]; title = f"New failed tests of {ci_event}" + ":" + f" {test_name}"; print(title)')" >> $GITHUB_ENV
178226
179227
- name: Send processed report
180-
if: ${{ env.process == 'true' && !endsWith(env.REPORT_TEXT, '{}') }}
228+
if: ${{ !endsWith(env.REPORT_TEXT, '{}') }}
181229
uses: slackapi/slack-github-action@6c661ce58804a1a20f6dc5fbee7f0381b469e001
182230
with:
183231
# Slack channel id, channel name, or user id to post message.

docker/transformers-all-latest-gpu/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ RUN git clone https://github.com/huggingface/transformers && cd transformers &&
2424
# 1. Put several commands in a single `RUN` to avoid image/layer exporting issue. Could be revised in the future.
2525
# 2. Regarding `torch` part, We might need to specify proper versions for `torchvision` and `torchaudio`.
2626
# Currently, let's not bother to specify their versions explicitly (so installed with their latest release versions).
27-
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
27+
# 3. For `torchcodec<0.8`: this is quickly added as torch 2.9.0 + torchcodec 0.8.0 fails on our CI env. Need to remove later once they work.
28+
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] && [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile && echo torch=$VERSION && [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio "torchcodec<0.8" --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio torchcodec --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
2829

2930
RUN python3 -m pip install --no-cache-dir -U timm
3031

docs/source/en/perf_infer_gpu_multi.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,17 +45,21 @@ This guide shows how to enable tensor parallelism with Transformers and differen
4545

4646
## Partitioning a model
4747

48-
Transformers supports tensor parallelism if a model has a `tp_plan`. Set `tp_plan="auto"` to automatically use a tensor parallelism plan based on a model's predefined configuration.
48+
Transformers supports tensor parallelism if a model has a `tp_plan`. There are two ways to partition a model.
49+
50+
- Set `tp_plan="auto"` to automatically use a tensor parallelism plan based on a model's predefined configuration.
51+
- Define and pass a manual `tp_plan`.
52+
53+
<hfoptions id="tp_plan">
54+
<hfoption id="auto plan">
4955

5056
```py
5157
import os
5258
import torch
5359
from transformers import AutoModelForCausalLM, AutoTokenizer
5460

5561
# model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct" # better to visualize all the possible strategies
56-
model_id = "meta-llama/Meta-Llama-3-8B-Instruct" # better for smaller number of GPUs
57-
58-
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, tp_plan="auto")
62+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct" , dtype=torch.bfloat16, tp_plan="auto")
5963
print(model._tp_plan)
6064

6165
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
@@ -72,6 +76,31 @@ Launch the inference script above on [torchrun](https://pytorch.org/docs/stable/
7276
torchrun --nproc-per-node 4 demo.py
7377
```
7478

79+
</hfoption>
80+
<hfoption id="manual plan">
81+
82+
Define a tensor parallel plan for each layer in `tp_plan` and pass it to [`~PreTrainedModel.from_pretrained`]. The example below uses column and row partitioning. See the [Partitioning strategies](#partitioning-strategies) section for other supported strategies.
83+
84+
Manual partitioning requires deep understanding of model architecture and strategy interactions. Poor partitioning choices create slow models that fail or produce incorrect results. The [Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=tensor_parallelism) explains partitioning strategies in detail.
85+
86+
```py
87+
from transformers import AutoModelForCausalLM
88+
89+
tp_plan = {
90+
"model.layers.*.self_attn.q_proj": "colwise",
91+
"model.layers.*.self_attn.k_proj": "colwise",
92+
"model.layers.*.self_attn.v_proj": "colwise",
93+
"model.layers.*.self_attn.o_proj": "rowwise",
94+
...
95+
}
96+
97+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", dtype="auto", tp_plan=tp_plan)
98+
print(model.tp_plan)
99+
```
100+
101+
</hfoption>
102+
</hfoptions>
103+
75104
## Partitioning strategies
76105

77106
All partitioning strategies are defined in the [`ParallelInterface`] class which maps a string to the strategy implementation. You don't need to interact with this class directly since all the strategies are set with `tp_plan` in [`~PreTrainedModel.from_pretrained`], but it is useful for checking what strategies are available.

docs/source/ja/perf_train_gpu_many.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -472,8 +472,6 @@ FlexFlowは、サンプル-オペレータ-属性-パラメータの4D並列化
472472

473473
したがって、このフレームワークの約束は非常に魅力的です。選択したクラスタで30分間のシミュレーションを実行し、この特定の環境を最適に利用するための最良の戦略を提供します。部分を追加/削除/置換すると、それに対して実行して再最適化プランを作成します。その後、トレーニングできます。異なるセットアップには独自の最適化があります。
474474

475-
🤗 Transformersの現在の状況: まだ統合されていません。すでに[transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py)を使用してモデルがFXトレース可能であるため、FlexFlowを動作させるために必要な手順を誰かが見つける必要があります。
476-
477475
## Which Strategy To Use When
478476

479477
ここでは、どの並列化戦略をいつ使用するかの非常におおまかなアウトラインを示します。各リストの最初が通常よりも速いことが一般的です。

docs/source/ko/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1057,7 +1057,7 @@
10571057
title: FLAVA
10581058
- local: model_doc/gemma3
10591059
title: Gemma3
1060-
- local: in_translation
1060+
- local: model_doc/gemma3n
10611061
title: Gemma3n
10621062
- local: in_translation
10631063
title: GIT

0 commit comments

Comments
 (0)