support optim cases #21

Chao1Han · 2025-07-21T06:54:23Z

Fixes #ISSUE_NUMBER

…ch#158926) Summary: The following type of objects don't need to be serialized for precompile: 1. PyCapsule because we don't guard on C binding objects in meaningful ways. 2. Code object because we only id matching on these but id matches will always be dropped for precompile. 3. Nested function objects since we also ban CLOSURE_MATCH. Test Plan: buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects Rollback Plan: Differential Revision: D78816888 Pull Request resolved: pytorch#158926 Approved by: https://github.com/jamesjwu

…ytorch#159904) Pull Request resolved: pytorch#159904 Approved by: https://github.com/janeyx99

Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable. Pull Request resolved: pytorch#158560 Approved by: https://github.com/yushangdi

…) (pytorch#159801) Summary: ### PR Context Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners". Test Plan: CI Rollback Plan: Differential Revision: D79590797 Pull Request resolved: pytorch#159801 Approved by: https://github.com/saumishr

…es (pytorch#159957) After pytorch#157905 started using cuBLAS for row-wise scaling on CUDA 12.9+, this broke some downstream tests for fp8 which were testing "odd" shapes. After checking in with the cuBLAS team this turned out to be due to the scale tensors' starting addresses not being aligned to 16 bytes. PyTorch storages are always aligned at 256 bytes, hence this came from a "slicing" of the scale tensor being done inside async-TP when chunking a matmul in order to overlap it with reduce-scatter. Pull Request resolved: pytorch#159957 Approved by: https://github.com/vkuzo, https://github.com/danielvegamyhre

Summary: This fixes a bug in the execution fram cleanup logic - previously, whenever we hit the time interval to clear out the frames, we were removing any cached execution frames beyond the configured minimum number (frameEntry.used was unused). Instead, we only want to clear frames that were NOT USED in during the last time interval. This diff refactors the executor to have the correct logic. Test Plan: ``` buck2 test 'mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:model_runner_test -- ModelRunnerTest.Basic_InterpreterCuda_Multithread_Cleanup --run-disabled --print-passing-details ``` Rollback Plan: Differential Revision: D78621408 Pull Request resolved: pytorch#158717 Approved by: https://github.com/dolpm

Summary: This PR solves two issues: 1. When lowering the all_reduce op, Inductor expects to convert it to the in-place version, all_reduce_, but it was calling ir._AllReduceKernel.create_inplace instead of ir._AllReduce_Kernel.create_inplace. This triggers a tricky bug in AOIT because it generates cpp call to the functional version aoti_torch_cpu__c10d_functional_all_reduce, but later corresponding wait operation will still wait on the input to aoti_torch_cpu__c10d_functional_all_reduce instead of the output from aoti_torch_cpu__c10d_functional_all_reduce. This causes unwaited tensor leading to memory leak. 2. Since AOTI generates the inplace version aoti_torch_cpu__c10d_functional_all_reduce_ now. The return tensor from aoti_torch_cpu__c10d_functional_all_reduce_ doesn't get used. It will be released when the program exists, so it's not a memory leak but it will unnecessarily hold that tensor which causes high memory water mark. This PR generates tensor delete operation right after calling aoti_torch_cpu__c10d_functional_all_reduce_. Pull Request resolved: pytorch#159818 Approved by: https://github.com/henryhu6, https://github.com/yushangdi

…dVariable (pytorch#159696)" This reverts commit ee62177. Reverted pytorch#159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](pytorch#159696 (comment)))

This script adds a simple dataloading benchmark tracking throughput and memory. The output looks like this ``` System Information: PyTorch version: 2.9.0a0+gitf87d117 PyTorch location: /home/divyanshkhanna/pytorch/torch/__init__.py Torchvision version: 0.24.0a0+f52c4f1 Torchvision location: /home/divyanshkhanna/pytorch/vision/torchvision/__init__.py CUDA available: True CUDA device: NVIDIA PG509-210 CPU count: 192 Physical CPU cores: 96 Total system memory: 1510.11 GB Loading dataset from imagenet/val (1 copies) Dataset size: 50000 --- Benchmarking DataLoader with worker_method=multiprocessing --- Memory before DataLoader creation: 500.59 MB Detailed memory information: USS (Unique Set Size): 499.00 MB PSS (Proportional Set Size): 500.74 MB RSS (Resident Set Size): 497.39 MB Memory after DataLoader creation: 1127.61 MB Memory increase: 627.02 MB Starting training loop with 1 epochs (max 100 batches per epoch) Epoch 1, Batch 10, Time: 0.2910s, Memory: 12044.50 MB Epoch 1, Batch 20, Time: 0.2909s, Memory: 12185.71 MB Epoch 1, Batch 30, Time: 0.2909s, Memory: 10654.93 MB Epoch 1, Batch 40, Time: 0.2909s, Memory: 12378.26 MB Epoch 1, Batch 50, Time: 0.2907s, Memory: 12402.28 MB Epoch 1, Batch 60, Time: 0.2909s, Memory: 10559.35 MB Epoch 1, Batch 70, Time: 0.2907s, Memory: 12644.69 MB Epoch 1, Batch 80, Time: 0.2909s, Memory: 12654.65 MB Epoch 1, Batch 90, Time: 0.2909s, Memory: 12727.20 MB Epoch 1, Batch 100, Time: 0.2908s, Memory: 12722.09 MB Results: Worker method: multiprocessing DataLoader init time: 0.1553 seconds Average batch time: 0.3408 seconds Samples per second: 375.53 Peak memory usage: 12738.76 MB Memory increase: 12238.17 MB ``` > TODO: This script right now is CPU-only friendly and GPU friendly. But it might be worth upgrading it to test against a canonical DistributedDataParallel setup on say a 1x8 node. Or maybe we can keep that as a separate script inside `benchmarks` Pull Request resolved: pytorch#159432 Approved by: https://github.com/ramanishsingh

…trol. (pytorch#159938) We need to add inductor debug symbol support for crash case debug. When we turn on generate debug symbol. On Windows, it should create a [module_name].pdb file. It helps debug by WinDBG. On Linux, it should create some debug sections in binary file. I added UT for it also. It works well on Windows inductor debug. <img width="1648" height="833" alt="image" src="https://github.com/user-attachments/assets/5282a7de-cef3-4a38-9cd4-a0e63482c8b6" /> Pull Request resolved: pytorch#159938 Approved by: https://github.com/jansel, https://github.com/angelayi

Delete older enums, checks for MacOS-13.3+ for int64 support, etc Fixes pytorch#159275 Pull Request resolved: pytorch#159912 Approved by: https://github.com/manuelcandales

…ytorch#159759) Fixes pytorch#159631 Pull Request resolved: pytorch#159759 Approved by: https://github.com/EikanWang, https://github.com/jansel

Before this change there were build+test jobs: - s89 build+tests - sm75 build+distributed_test - sm_75 build+pr_time_benchmark test This change compiles all 3 builds into one (for 2 architectures) and skips testing sm86 as it never found any new regressions that were not found at the same time on sm89 Pull Request resolved: pytorch#159890 Approved by: https://github.com/clee2000, https://github.com/seemethere

Summary: This fixes the build of TorchScript's XNNPACK dependency for our aarch64 device. Thanks to andrewjcg for proposing this fix. Rollback Plan: Reviewed By: andrewjcg Differential Revision: D79497613 Pull Request resolved: pytorch#159762 Approved by: https://github.com/frankseide, https://github.com/malfet Co-authored-by: Frank Seide <seide@meta.com>

…ch#159662) This only works for the jagged layout and for the non-batch and non-jagged dimensions. I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work. Pull Request resolved: pytorch#159662 Approved by: https://github.com/jbschlosser

The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn Pull Request resolved: pytorch#158637 Approved by: https://github.com/jbschlosser

**Summary** Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch. Pull Request resolved: pytorch#158629 Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang

As we don't have any Intel Mac runners in CI for last 2+ years Pull Request resolved: pytorch#159986 Approved by: https://github.com/atalman

Summary: This reverts the part of pytorch#159383 for scaled_mm where now, like before, we pass through the normal input_nodes (not the triton_input_nodes) to select_algorithm - pytorch#159383 refactored how kwargs are retrieved - it introduced this notion of KernelInputs that wrap input_nodes - scaled_mm uses unsqueezed input nodes for triton to retrieve params - the issue: it uses a squeezed (regular) bias for select_algorithm instead This fixes that by passing the original input nodes rather than the triton input nodes. Test Plan: ``` buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_False (caffe2.test.inductor.test_fp8.TestFP8Lowering)' buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:fp8 -- --exact 'caffe2/test/inductor:fp8 - test_rowwise_scaling_shape_1024,1024,512_has_bias_True_use_fast_accum_True_persistent_matmul_True (caffe2.test.inductor.test_fp8.TestFP8Lowering)' ``` This set of tests was failing, and is passing now Side note: these tests were failing I believe because the unsqueezed bias made the ATEN choice no longer eligible, and there is some minor numerical discrepancy between ATEN and Triton for this. I'm not sure the test should be written like that, as we're implicitly relying on ATEN being the choice here. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D79717654](https://our.internmc.facebook.com/intern/diff/D79717654) Pull Request resolved: pytorch#159948 Approved by: https://github.com/izaitsevfb, https://github.com/eellison

Pull Request resolved: pytorch#159467 Approved by: https://github.com/eellison

…h#159315) As part of better engineering effort, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to `torch/_dynamo/guards.py` Running ``` mypy torch/_dynamo/guards.py --linecount-report /tmp/coverage_log ``` | -------- | Lines Annotated | Lines Total | % lines covered | Funcs Annotated | Funcs Total | % funcs covered | | -------- | ------- | -------- | ------- | ------- | ------- | ------- | | Main | 2030 | 3945 | 51.46% | 70 | 138 | 50.72% | | This PR | 4055 | 4055 | 100.00% | 138 | 138 | 100.00% | | Delta | +2025 | +90 | +48.54% | +68 | 0 | +49.28% | Pull Request resolved: pytorch#159315 Approved by: https://github.com/williamwen42, https://github.com/Skylion007

…ytorch#157892) Fixes pytorch#157891 Pull Request resolved: pytorch#157892 Approved by: https://github.com/ezyang

…k based log directory (pytorch#159874) Summary: Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast. Test Plan: See: D79456310 Pull Request resolved: pytorch#159874 Approved by: https://github.com/c00w

…s tests from pytorch#125438 (pytorch#157786) These tests now pass on AArch64 in our downstream CI. `test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]` Pull Request resolved: pytorch#157786 Approved by: https://github.com/jerryzh168, https://github.com/malfet

Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support. This had been updated when building/linking with vendored Gloo, but not when using system Gloo. Fixes: pytorch#146239 Reported-by: Adam J Stewart <ajstewart426@gmail.com> Pull Request resolved: pytorch#146637 Approved by: https://github.com/malfet

@pytorchbot

This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: pytorch#153343 Approved by: https://github.com/ydwu4

…h condition to match native kernel (pytorch#156140) The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN pytorch#155225 Pull Request resolved: pytorch#156140 Approved by: https://github.com/ngimel, https://github.com/atalman

…ch#160398) Per title, should fix capture errors that happen because nccl watchdog races with capture start. Pull Request resolved: pytorch#160398 Approved by: https://github.com/aorenste

See https://hud.pytorch.org/pytorch/pytorch/commit/6b414f56a4a133a428af618d8ed1553849341497 Pull Request resolved: pytorch#159961 Approved by: https://github.com/eellison

# Context This is an extension of pytorch#149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and pytorch#160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870) Pull Request resolved: pytorch#160163 Approved by: https://github.com/d4l3k

We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare. Happy to be talked into doing it though if someone really wants to. Pull Request resolved: pytorch#160098 Approved by: https://github.com/ezyang, https://github.com/soulitzer

Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in pytorch#159405. This PR removes fsspec usages in consolidation script and relies on local storage only Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/) Pull Request resolved: pytorch#159392 Approved by: https://github.com/sibuachu

…ytorch#159935) Fixes pytorch#152985 In pytorch#152985, users are confused why weights-only load failed even though functions were registered in safe_globals. Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration. This fix is to make that point more clear. Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out. ``` _pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'> Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. To execute this test, run the following from the base repo dir: python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: pytorch#159935 Approved by: https://github.com/mikaylagawarecki

…h#160403) Fixes pytorch#160243, Fixes pytorch#160244, Fixes pytorch#160245 Pull Request resolved: pytorch#160403 Approved by: https://github.com/janeyx99

Reduces collective calls in the forward pass from 2 to 1 In pytorch#158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After pytorch#159324 this should get properly dispatched hence I am adding it now. Pull Request resolved: pytorch#159692 Approved by: https://github.com/tianyu-l

…160135) **Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands. https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json Pull Request resolved: pytorch#160135 Approved by: https://github.com/weifengpy

…nsions (pytorch#159652) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: pytorch#159652 Approved by: https://github.com/jeffdaily

Allows things like ```cpp Tensor cu_seqlens_q; if (...) { cu_seqlens_q = ... } ... ``` Also adds `torch::stable::Tensor.defined()` Pull Request resolved: pytorch#159507 Approved by: https://github.com/janeyx99

Pull Request resolved: pytorch#159328 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#159507

Summary: the condition ``` if config.is_fbcode() and (not self._aot_mode or self._use_relative_path): sources = [os.path.basename(i) for i in sources] ``` unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths. Fixes internal issue: ``` FAILED (errors=1) CppCompileError: C++ compile error Command: /mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so Output: clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp' clang-17: error: no input files ``` Reviewed By: clee2000 Differential Revision: D80025417 Pull Request resolved: pytorch#160354 Approved by: https://github.com/benjaminglass1, https://github.com/clee2000

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) Pull Request resolved: pytorch#160386 Approved by: https://github.com/eellison

…torch#160357) # Summary More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting Before <img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" /> After: <img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" /> Pull Request resolved: pytorch#160357 Approved by: https://github.com/eellison

Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/) Context: When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by ``` Traceback (most recent call last): File "/torch/_inductor/autotune_process.py", line 175, in get result = TuningProcess.recv(self.read_pipe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/autotune_process.py", line 99, in recv return pickle.load(read_pipe) ^^^^^^^^^^^^^^^^^^^^^^ TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output' ``` which is unexpected. After asking claude, it seems > Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly. > > The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments. > > The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly. > > The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix: Adding these seem to help. fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541) Pull Request resolved: pytorch#160294 Approved by: https://github.com/masnesral

Summary: Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0 We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API. This PR adds a prototype for the API described in the doc. Test Plan: test_misc -- -k test_aot_capture Rollback Plan: Differential Revision: D79534608 Pull Request resolved: pytorch#159749 Approved by: https://github.com/tugsbayasgalan

enable indices and values on sparse mps Pull Request resolved: pytorch#160223 Approved by: https://github.com/malfet

…unners (pytorch#158882) Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners This should help increase available runners even with same number of CI nodes. Pull Request resolved: pytorch#158882 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch' ## Changes Made - Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch' - Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml - Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files ## Files Modified This PR updates files that contained the target text. Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z Pull Request resolved: pytorch#160459 Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet

…ager init (pytorch#160145) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: pytorch#160145 Approved by: https://github.com/kwen2501

Pull Request resolved: pytorch#160477 Approved by: https://github.com/atalman

Pull Request resolved: pytorch#160479 Approved by: https://github.com/izaitsevfb ghstack dependencies: pytorch#160477

…ytorch#160422) Summary: A recent Triton commit changed `ASTSource.make_ir` to a 5-arg signature that includes a `GPUTarget`. We need to pass in this new argument. Test Plan: `buck2 test 'fbcode//mode/opt' -m ovr_config//triton:trunk fbcode//caffe2/test/inductor:test_inductor_cuda -- triton_kernel` Rollback Plan: Reviewed By: davidberard98 Differential Revision: D80069909 Pull Request resolved: pytorch#160422 Approved by: https://github.com/davidberard98, https://github.com/mlazos

) Summary: When the shape of the output tensor has a dynamic outer most dim, the stride can still be padded to conform to configured alignment if required. Test Plan: CI Rollback Plan: Differential Revision: D79146886 Pull Request resolved: pytorch#159404 Approved by: https://github.com/blaine-rister, https://github.com/eellison

Chao1Han force-pushed the elastic branch from fc983a4 to f4826ef Compare July 28, 2025 02:45

pytorchmergebot force-pushed the elastic branch 3 times, most recently from 7e58bb3 to c7f895e Compare July 31, 2025 09:15

zhxchen17 and others added 26 commits August 6, 2025 15:00

[Easy] Fix wrong propagation of fallback_ops_dict in gen_aoti_c_shim (p…

d87161c

…ytorch#159904) Pull Request resolved: pytorch#159904 Approved by: https://github.com/janeyx99

Revert "[dynamo] Be consistent with storing func source for UserMetho…

ba37f58

…dVariable (pytorch#159696)" This reverts commit ee62177. Reverted pytorch#159696 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](pytorch#159696 (comment)))

[MPS] Remove all pre-MacOS14 logic (pytorch#159912)

d10e9e4

Delete older enums, checks for MacOS-13.3+ for int64 support, etc Fixes pytorch#159275 Pull Request resolved: pytorch#159912 Approved by: https://github.com/manuelcandales

[Inductor UT][Fix XPU CI] Fix case failures introduced by community. (p…

12a54e4

…ytorch#159759) Fixes pytorch#159631 Pull Request resolved: pytorch#159759 Approved by: https://github.com/EikanWang, https://github.com/jansel

[EZ] Remove useless cross_compile_arm64 (pytorch#159986)

512b473

As we don't have any Intel Mac runners in CI for last 2+ years Pull Request resolved: pytorch#159986 Approved by: https://github.com/atalman

Remove unnecessary "# noqa: set_linter" comments (pytorch#159467)

a572596

Pull Request resolved: pytorch#159467 Approved by: https://github.com/eellison

Partitioner: Fix to align partition node order with original graph (p…

2507ae6

…ytorch#157892) Fixes pytorch#157891 Pull Request resolved: pytorch#157892 Approved by: https://github.com/ezyang

[HOP, map] Rework of map autograd to the new interface (pytorch#153343)

64dc30c

This PR reworks the current autograd implementation of map to the new interface. @pytorchbot label "topic: not user facing" Pull Request resolved: pytorch#153343 Approved by: https://github.com/ydwu4

eqy and others added 26 commits August 12, 2025 18:07

move thread-local capture mode guard to include work.isStarted (pytor…

2d0cdee

…ch#160398) Per title, should fix capture errors that happen because nccl watchdog races with capture start. Pull Request resolved: pytorch#160398 Approved by: https://github.com/aorenste

[inductor] fix triton bucketize mask propagation (pytorch#159961)

89654db

See https://hud.pytorch.org/pytorch/pytorch/commit/6b414f56a4a133a428af618d8ed1553849341497 Pull Request resolved: pytorch#159961 Approved by: https://github.com/eellison

[Fix XPU CI][Inductor UT] Fix test cases broken by community. (pytorc…

5a9c4cf

…h#160403) Fixes pytorch#160243, Fixes pytorch#160244, Fixes pytorch#160245 Pull Request resolved: pytorch#160403 Approved by: https://github.com/janeyx99

Update torch::stable::Tensor() default constructor (pytorch#159507)

655137b

Allows things like ```cpp Tensor cu_seqlens_q; if (...) { cu_seqlens_q = ... } ... ``` Also adds `torch::stable::Tensor.defined()` Pull Request resolved: pytorch#159507 Approved by: https://github.com/janeyx99

Add pad and narrow to torch/csrc/stable/ops.h (pytorch#159328)

4d419a7

Pull Request resolved: pytorch#159328 Approved by: https://github.com/janeyx99 ghstack dependencies: pytorch#159507

[MPS] Add mps keys to indices and values ops (pytorch#160223)

2e4e5ab

enable indices and values on sparse mps Pull Request resolved: pytorch#160223 Approved by: https://github.com/malfet

[EZ][BE] Remove unused conda-env-macOS-ARM64 (pytorch#160477)

8d1cf52

Pull Request resolved: pytorch#160477 Approved by: https://github.com/atalman

[EZ] Delete CircleCI case (pytorch#160479)

3209996

Pull Request resolved: pytorch#160479 Approved by: https://github.com/izaitsevfb ghstack dependencies: pytorch#160477

pytorchmergebot force-pushed the elastic branch from c7f895e to ec4c619 Compare August 13, 2025 02:01

Chao1Han added 2 commits August 13, 2025 02:04

support optim cases

e992eaa

blacklist gloo

62eb2ca

pytorchmergebot force-pushed the elastic branch from ec4c619 to 62eb2ca Compare August 13, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support optim cases #21

support optim cases #21

Uh oh!

Chao1Han commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

116 participants

support optim cases #21

Are you sure you want to change the base?

support optim cases #21

Uh oh!

Conversation

Chao1Han commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

116 participants