Skip to content

Conversation

@Chao1Han
Copy link
Owner

Fixes #ISSUE_NUMBER

zhxchen17 and others added 26 commits August 6, 2025 15:00
…ch#158926)

Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.

Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects

Rollback Plan:

Differential Revision: D78816888

Pull Request resolved: pytorch#158926
Approved by: https://github.com/jamesjwu
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.

Pull Request resolved: pytorch#158560
Approved by: https://github.com/yushangdi
…) (pytorch#159801)

Summary:

### PR Context

Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners".

Test Plan:
CI

Rollback Plan:

Differential Revision: D79590797

Pull Request resolved: pytorch#159801
Approved by: https://github.com/saumishr
…es (pytorch#159957)

After pytorch#157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter.

Pull Request resolved: pytorch#159957
Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre
Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic.

Test Plan:
```
buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details
```

Rollback Plan:

Differential Revision: D78621408

Pull Request resolved: pytorch#158717
Approved by: https://github.com/dolpm
Summary: This PR solves two issues:

1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak.

2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_.

Pull Request resolved: pytorch#159818
Approved by: https://github.com/henryhu6, https://github.com/yushangdi
This script adds a simple dataloading benchmark tracking throughput and memory.

The output looks like this
```
System Information:
  PyTorch version: 2.9.0a0+gitf87d117
  PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py
  Torchvision version: 0.24.0a0+f52c4f1
  Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py
  CUDA available: True
  CUDA device: NVIDIA PG509-210
  CPU count: 192
  Physical CPU cores: 96
  Total system memory: 1510.11 GB

Loading dataset from imagenet/val (1 copies)
Dataset size: 50000

--- Benchmarking DataLoader with worker_method=multiprocessing ---
Memory before DataLoader creation: 500.59 MB

Detailed memory information:
  USS (Unique Set Size): 499.00 MB
  PSS (Proportional Set Size): 500.74 MB
  RSS (Resident Set Size): 497.39 MB
Memory after DataLoader creation: 1127.61 MB
Memory increase: 627.02 MB
Starting training loop with 1 epochs (max 100 batches per epoch)
Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB
Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB
Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB
Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB
Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB
Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB
Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB
Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB
Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB
Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB

Results:
  Worker method: multiprocessing
  DataLoader init time: 0.1553 seconds
  Average batch time: 0.3408 seconds
  Samples per second: 375.53
  Peak memory usage: 12738.76 MB
  Memory increase: 12238.17 MB
```

> TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks`
Pull Request resolved: pytorch#159432
Approved by: https://github.com/ramanishsingh
…trol. (pytorch#159938)

We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol.
On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG.
On Linux, it should create some debug sections in binary file.

I added UT for it also.

It works well on Windows inductor debug.
<img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" />

Pull Request resolved: pytorch#159938
Approved by: https://github.com/jansel, https://github.com/angelayi
Delete older enums, checks for MacOS-13.3+ for int64 support, etc

Fixes pytorch#159275
Pull Request resolved: pytorch#159912
Approved by: https://github.com/manuelcandales
Before this change there were build+test jobs:
 - s89 build+tests
 -  sm75 build+distributed_test
 - sm_75 build+pr_time_benchmark test
This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89
Pull Request resolved: pytorch#159890
Approved by: https://github.com/clee2000, https://github.com/seemethere
Summary:
This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device.

Thanks to andrewjcg for proposing this fix.

Rollback Plan:

Reviewed By: andrewjcg

Differential Revision: D79497613

Pull Request resolved: pytorch#159762
Approved by: https://github.com/frankseide, https://github.com/malfet

Co-authored-by: Frank Seide <seide@meta.com>
…ch#159662)

This only works for the jagged layout and for the non-batch and non-jagged dimensions.

I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work.
Pull Request resolved: pytorch#159662
Approved by: https://github.com/jbschlosser
The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn
Pull Request resolved: pytorch#158637
Approved by: https://github.com/jbschlosser
**Summary**
Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch.
Pull Request resolved: pytorch#158629
Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang
As we don't have any Intel Mac runners in CI for last 2+ years
Pull Request resolved: pytorch#159986
Approved by: https://github.com/atalman
Summary:

This reverts the part of pytorch#159383 for scaled_mm where now, like before,
we pass through the normal input_nodes (not the triton_input_nodes)
to select_algorithm

- pytorch#159383 refactored how kwargs are retrieved
- it introduced this notion of KernelInputs that wrap input_nodes
- scaled_mm uses unsqueezed input nodes for triton to retrieve params
- the issue: it uses a squeezed (regular) bias for select_algorithm
  instead

This fixes that by passing the original input nodes rather
than the triton input nodes.

Test Plan:

```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)'
```

This set of tests was failing, and is passing now

Side note: these tests were failing I believe because the unsqueezed
bias made the ATEN choice no longer eligible, and there is some minor
numerical discrepancy between ATEN and Triton for this. I'm not sure
the test should be written like that, as we're implicitly relying on
ATEN being the choice here.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654)
Pull Request resolved: pytorch#159948
Approved by: https://github.com/izaitsevfb, https://github.com/eellison
…h#159315)

As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo

This PR adds strict typing support to `torch/_dynamo/guards.py`

Running
```
mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log
```

| -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main  |  2030 | 3945 | 51.46% | 70 | 138 | 50.72% |
| This PR | 4055 | 4055 | 100.00% | 138 | 138 | 100.00% |
| Delta    | +2025 | +90 | +48.54% | +68 | 0 | +49.28% |

Pull Request resolved: pytorch#159315
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
…k based log directory (pytorch#159874)

Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.

Test Plan:
See: D79456310

Pull Request resolved: pytorch#159874
Approved by: https://github.com/c00w
…s tests from pytorch#125438 (pytorch#157786)

These tests now pass on AArch64 in our downstream CI.

`test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]`

Pull Request resolved: pytorch#157786
Approved by: https://github.com/jerryzh168, https://github.com/malfet
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: pytorch#146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>

Pull Request resolved: pytorch#146637
Approved by: https://github.com/malfet
This PR reworks the current autograd implementation of map to the new interface.

@pytorchbot label "topic: not user facing"

Pull Request resolved: pytorch#153343
Approved by: https://github.com/ydwu4
eqy and others added 26 commits August 12, 2025 18:07
…h condition to match native kernel (pytorch#156140)

The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

pytorch#155225

Pull Request resolved: pytorch#156140
Approved by: https://github.com/ngimel, https://github.com/atalman
…ch#160398)

Per title, should fix capture errors that happen because nccl watchdog races with capture start.

Pull Request resolved: pytorch#160398
Approved by: https://github.com/aorenste
# Context
This is an extension of pytorch#149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and pytorch#160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870)

Pull Request resolved: pytorch#160163
Approved by: https://github.com/d4l3k
We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare.
Happy to be talked into doing it though if someone really wants to.
Pull Request resolved: pytorch#160098
Approved by: https://github.com/ezyang, https://github.com/soulitzer
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in pytorch#159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: pytorch#159392
Approved by: https://github.com/sibuachu
…ytorch#159935)

Fixes pytorch#152985
In pytorch#152985, users are confused why weights-only load failed even though functions were registered in safe_globals.
Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration.
This fix is to make that point more clear.

Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out.
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:

Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'>

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

To execute this test, run the following from the base repo dir:
    python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: pytorch#159935
Approved by: https://github.com/mikaylagawarecki
Reduces collective calls in the forward pass from 2 to 1

In pytorch#158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After pytorch#159324 this should get properly dispatched hence I am adding it now.

Pull Request resolved: pytorch#159692
Approved by: https://github.com/tianyu-l
…160135)

**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json

Pull Request resolved: pytorch#160135
Approved by: https://github.com/weifengpy
…nsions (pytorch#159652)

In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: pytorch#159652
Approved by: https://github.com/jeffdaily
Allows things like

```cpp
Tensor cu_seqlens_q;
if (...) {
   cu_seqlens_q = ...
}
...
```

Also adds `torch::stable::Tensor.defined()`

Pull Request resolved: pytorch#159507
Approved by: https://github.com/janeyx99
Summary:
the condition
```
if config.is_fbcode() and (not self._aot_mode or self._use_relative_path):
    sources = [os.path.basename(i) for i in sources]
```
unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths.

Fixes internal issue:
```

FAILED (errors=1)

CppCompileError: C++ compile error

Command:
/mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so

Output:
clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp'
clang-17: error: no input files
```

Reviewed By: clee2000

Differential Revision: D80025417

Pull Request resolved: pytorch#160354
Approved by: https://github.com/benjaminglass1, https://github.com/clee2000
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: pytorch#160386
Approved by: https://github.com/eellison
…torch#160357)

# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: pytorch#160357
Approved by: https://github.com/eellison
Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/)

Context:
When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by
```
Traceback (most recent call last):
  File "/torch/_inductor/autotune_process.py", line 175, in get
    result = TuningProcess.recv(self.read_pipe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/autotune_process.py", line 99, in recv
    return pickle.load(read_pipe)
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output'
```
which is unexpected. After asking claude, it seems

> Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly.
>
> The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments.
>
> The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly.
>
> The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix:

Adding these seem to help.

fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541)

Pull Request resolved: pytorch#160294
Approved by: https://github.com/masnesral
Summary:
Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0

We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API.

This PR adds a prototype for the API described in the doc.

Test Plan:
test_misc -- -k test_aot_capture

Rollback Plan:

Differential Revision: D79534608

Pull Request resolved: pytorch#159749
Approved by: https://github.com/tugsbayasgalan
enable indices and values on sparse mps

Pull Request resolved: pytorch#160223
Approved by: https://github.com/malfet
…unners (pytorch#158882)

Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: pytorch#158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: pytorch#160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
…ager init (pytorch#160145)

Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.

Pull Request resolved: pytorch#160145
Approved by: https://github.com/kwen2501
…ytorch#160422)

Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument.

Test Plan:
`buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk  fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel`

Rollback Plan:

Reviewed By: davidberard98

Differential Revision: D80069909

Pull Request resolved: pytorch#160422
Approved by: https://github.com/davidberard98, https://github.com/mlazos
)

Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79146886

Pull Request resolved: pytorch#159404
Approved by: https://github.com/blaine-rister, https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.