Skip to content

Conversation

@tushar00jain
Copy link
Contributor

Summary:

  • call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
  • change the env var that's used to determine the file after every quorum

Differential Revision: D84260745

@meta-codesync
Copy link

meta-codesync bot commented Oct 16, 2025

@tushar00jain has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84260745.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2025
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from 1d99280 to d048341 Compare October 16, 2025 16:31
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 16, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 17, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 17, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 17, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from ec50ef7 to 575d7c0 Compare October 21, 2025 17:30
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 21, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Oct 24, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 3, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 4, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum

Reviewed By: d4l3k

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from e603706 to 59b647a Compare November 6, 2025 19:04
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from 189dbad to c5a3407 Compare November 6, 2025 23:12
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 6, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
Summary:

Remove device mesh since we don't really use it. Device mesh is undergoing a lot of changes and using private api's makes the subclass difficult to maintain. We will revisit device mesh integration with public api's.

Differential Revision: D86466239
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 7, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
@tushar00jain tushar00jain force-pushed the export-D84260745 branch 2 times, most recently from c28793a to 831683f Compare November 7, 2025 01:36
tushar00jain added a commit to tushar00jain/torchft that referenced this pull request Nov 7, 2025
Summary:

- call FR api to reset the trace after every quorum - we reset so that after every quorum, we start a fresh FR trace since the pg's could have changed and we already dumped FR trace from previous errors
- change the env var that's used to determine the file after every quorum
- return replica id's in quorum response so we can determine global ranks in the pg - this is used to set the metadata on the pg for flight recorder to work

Reviewed By: d4l3k

Differential Revision: D84260745
@meta-codesync meta-codesync bot closed this in 854fb2d Nov 7, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 7, 2025

This pull request has been merged in 854fb2d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants