Skip to content

Fix lint #10089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4,782 commits into from
Closed

Fix lint #10089

wants to merge 4,782 commits into from

Conversation

cccclai
Copy link
Contributor

@cccclai cccclai commented Apr 10, 2025

Fix lint error from #10054

swolchok and others added 30 commits March 26, 2025 13:57
I have no idea what this file actually does, but it seems like we are
supposed to have this?
…ch#9509)

Disable one, fix the other.

Testing: built internally
pytorch#9511)

I planned to do this everywhere and forgot. Clean it all up, leave a
note, enforce the note with visibility. This makes sure everything in
buck-land gets ET_USE_THREADPOOL.

Test Plan: Profiled run on internal model, no longer seeing
parallel_for_no_threadpool
…AME_AS_COMPUTE (pytorch#9613)

As the title says, this is mostly a few related find-replaces, plus
marking SupportedTensorDtypes::SAME_AS_COMPUTE deprecated.
As the code comment says, these APIs are undergoing development (see
e.g. pytorch#9613) and it's pretty
inconvenient that they're incidentally committed-to externally. Mark
them deprecated so we have the option to drop that commitment in (IIUC)
0.7.
Previous attempt to bump HF transformers version to latest is reverted
due to llava model imcompatibility. This PR is to just ensure the CI are
able to test `optimum-executorch` with latest version of HF transformers
and upcoming `executorch==0.6.0`.

Note: This change is purely on CI and only for optimum-executorch,
should not affect other models like llava.

Co-authored-by: Guang Yang <guangyang@fb.com>
Differential Revision: D69994481

Pull Request resolved: pytorch#8703
### Summary
We now have CoreML support out of the box for macOS wheels. Let's test
it.

### Test plan
CI

cc @larryliu0820 @lucylq

---------

Co-authored-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: pytorch#9588

TSIA

ghstack-source-id: 274222179
@exported-using-ghexport

Differential Revision: [D70435293](https://our.internmc.facebook.com/intern/diff/D70435293/)
Pull Request resolved: pytorch#9589

TSIA

@pytorchbot label "topic: not user facing"
ghstack-source-id: 274222181
@exported-using-ghexport

Differential Revision: [D71825480](https://our.internmc.facebook.com/intern/diff/D71825480/)
TSIA

@pytorchbot label "topic: not user facing"

Differential Revision: [D71825477](https://our.internmc.facebook.com/intern/diff/D71825477/)
TSIA

@pytorchbot label "topic: not user facing"

Differential Revision: [D71825476](https://our.internmc.facebook.com/intern/diff/D71825476/)

[ghstack-poisoned]
## Context


Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.

Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.

However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:

* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B

Compare this to a local work group size of `{8, 8, 1}`

* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B

Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.

## Changes

Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.

Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)

[ghstack-poisoned]
…encelength

Following previous diff now we can utilize entire kv cache to generate more
tokens than max prompt length allowed.

Differential Revision: D69073908
This can cuase issues with `disable_global_flags` and internal state of the library, this is something which is set when importing this

Differential Revision: [D70402061](https://our.internmc.facebook.com/intern/diff/D70402061/)

[ghstack-poisoned]
Differential Revision: D71901449

Pull Request resolved: pytorch#9646
Summary: As title

Reviewed By: larryliu0820, kirklandsign

Differential Revision: D71577157

Co-authored-by: Digant Desai <digantdesai@meta.com>
Differential Revision: D71901794

Pull Request resolved: pytorch#9668
Differential Revision: D71902542

Pull Request resolved: pytorch#9670
## Context

The bencmarks generated by the generated operator benchmarks currently have a high amount of copy overhead:

1. Copy from CPU to staging
2. Copy from staging to GPU Buffer/Image

And this is done for both inputs and outputs.

Since benchmarks are not correctness tests, copying data in/out is not really necessary especially if the compute shader does not have behaviour dependent on the contents of the input/output tensor.

Make it so that by default, the benchmark will only execute the op without adding copy overhead. However, test cases can optionally specify that the copy overhead should be included in the benchmark.

Differential Revision: [D71570143](https://our.internmc.facebook.com/intern/diff/D71570143/)
## Context

As title; similar to pytorch#9016 since the interface for `ComputePipeline` descriptor was reverted in  pytorch#9405.

Differential Revision: [D71706868](https://our.internmc.facebook.com/intern/diff/D71706868/)

[ghstack-poisoned]
Differential Revision: D71902713

Pull Request resolved: pytorch#9672
…of a shape.

Differential Revision: D71903681

Pull Request resolved: pytorch#9673
Differential Revision: D71904351

Pull Request resolved: pytorch#9674
Differential Revision: D71905631

Pull Request resolved: pytorch#9675
Differential Revision: D71905971

Pull Request resolved: pytorch#9676
Differential Revision: D71906972

Pull Request resolved: pytorch#9677
Differential Revision: D71908831

Pull Request resolved: pytorch#9678
Differential Revision: D71909752

Pull Request resolved: pytorch#9679
Jack-Khuu and others added 27 commits April 9, 2025 16:38
### Summary

Pulling in the non 0.6 changes from:

pytorch#10006
pytorch#10016

### Test plan
md
…ch#9355)

- Add API to qnn quantizer for setting submodule quant config
- Refine QnnQuantizer setting functions

---------

Co-authored-by: Chun-I Tsai <chunit@qti.qualcomm.com>
Remove unnecessary line
Fix ETDump part
Differential Revision: D72616610

Pull Request resolved: pytorch#9960
Differential Revision: D72440313

Pull Request resolved: pytorch#9894
Differential Revision: D72754398

Pull Request resolved: pytorch#10032
Tests in test_sigmoid_16bit.py and test_sigmoid_32bit.py randomly fails
due to a Vela bug regarding handling of the table op sigmoid is
converted to. Set all affected tests to flaky until the bug resolved.

Co-authored-by: Martin Lindström <Martin.Lindstroem@arm.com>
MobileNetV3 was sporadically failing with the previously set absolute
difference threshold. Raise it to prevent flaky test status.

Co-authored-by: Martin Lindström <Martin.Lindstroem@arm.com>
Summary:
## Context

pytorch#9938 made it so that
`linalg_vector_norm` is now decomposed when exporting to Edge. However,
this broke some tests in the arm delegate because export passes cannot
handle the decomposed operator sequence. To account for this, add
`xfail` for the failing tests since `linalg_vector_norm` is not
supported in TOSA yet.


## Changes

Add `xfail` for `norm` tests in `test_torch_functions.py`

Test Plan:
## Test Plan

Check CI that failing test is recovered.
Test currently xfails with key error relating to scalar_tensor

Signed-off-by: Ryan O'Shea <ryan.oshea3@arm.com>
…dim 1 or 2 (pytorch#10060)

For quantized SDPA we want to evaluate performance impact of having seq at dim 1 as well as dim 2.
This diff refactors the code to enable this.

The same should be done also for float SDPA but left for future.

Differential Revision: [D71833060](https://our.internmc.facebook.com/intern/diff/D71833060/)
…torch#10061)

Because old name was misnomer
ghstack-source-id: 277233486
@exported-using-ghexport
Differential Revision: [D71833067](https://our.internmc.facebook.com/intern/diff/D71833067/)
Enable leveraging quantized sdpa op when quantized kv cache is used. Instead of adding yet another arg, at the moment I have chosen to leverage quantize_kv_cache option.

Differential Revision: [D71833064](https://our.internmc.facebook.com/intern/diff/D71833064/)
Propagating some changes made to the release/0.6 docs so that future
release can get them too
…rch#9266)

### Summary
 - e2e script for https://github.com/yformer/EfficientSAM
 - Fastvit breakage fix
 - Add support for cum_sum
 - Add bicubic interpolate transform pass
 - Fix stack op

### Test plan
``` bash
python ./examples/qualcomm/oss_scripts/efficientSAM/efficientSAM.py -m ${soc} -b build-android -H ${host_id} -s ${device_id} --oss_repo ${Path_to_oss_repo} --pretrained_weight ${Path_to_pretrained_weight} -d ${Path_to_dataset_dir}
```
… them themselves

Differential Revision: D72600295

Pull Request resolved: pytorch#9952
Differential Revision: D72796889

Pull Request resolved: pytorch#10067
…ar_type (pytorch#10076)

Following pytorch#9971

- Update get_flatbuffer_scalar_type return type to Result<T>
- Iteratively update functions that calling the functions with result type changed:
  - Check returns, if with an error, pass above the error.
  - If unable to pass error, update the return type as Result<T>

Differential Revision: [D72771753](https://our.internmc.facebook.com/intern/diff/D72771753/)
We have the recipe and .pte file on ExecuTorch-Community on HF. So let's
just use that.
…ytorch#10054)

Summary:

- Support if the rank of input tensor is less than the rank of output
tensor.
- make_quantizer kwargs alignment.
- Remove module.eval() since calling eval() is not supported for
exported models.


### Test plan
``` bash
python -m backends.qualcomm.tests.test_qnn_delegate TestQNNQuantizedOperator.test_qnn_backend_expand -s ${device_id} -H ${host_id} -m ${soc} -b build-android
```
scripts/build_android_library.sh will no longer build demo app.
…#9989)

Add README for Android

Make instrumentation test easier for users on their local development.
Copy link

pytorch-bot bot commented Apr 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10089

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2025
@cccclai cccclai closed this Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.