st publish mode only load weight #3538

EddyLXJ · 2025-11-11T22:18:18Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2122

For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host.
So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight.
For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb

Differential Revision: D85830053

meta-codesync · 2025-11-11T22:18:27Z

@EddyLXJ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85830053.

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: X-link: pytorch/FBGEMM#5116 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: X-link: meta-pytorch/torchrec#3538 X-link: facebookresearch/FBGEMM#2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053

Summary: Pull Request resolved: #5116 X-link: meta-pytorch/torchrec#3538 X-link: https://github.com/facebookresearch/FBGEMM/pull/2122 For silvertorch publish, we don't want to load opt into backend due to limited cpu memory in publish host. So we need to load the whole row into state dict which loading the checkpoint in st publish, then only save weight into backend, after that backend will only have metaheader + weight. For the first loading, we need to set dim with metaheader_dim + emb_dim + optimizer_state_dim, otherwise the checkpoint loadding will throw size mismatch error. after the first loading, we only need to get metaheader+weight from backend for state dict, so we can set dim with metaheader_dim + emb Reviewed By: emlin Differential Revision: D85830053 fbshipit-source-id: 0eddbe9e69ea8271e8c77dc0147e87a08f0b3934

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 11, 2025

EddyLXJ mentioned this pull request Nov 11, 2025

st publish mode only load weight pytorch/FBGEMM#5116

Closed

EddyLXJ force-pushed the export-D85830053 branch from 287c250 to ca356f2 Compare November 13, 2025 20:55

EddyLXJ force-pushed the export-D85830053 branch from ca356f2 to 0f1db7f Compare November 13, 2025 20:55

EddyLXJ force-pushed the export-D85830053 branch from 0f1db7f to d1f771e Compare November 14, 2025 18:35

EddyLXJ force-pushed the export-D85830053 branch from d1f771e to 291ce24 Compare November 14, 2025 22:11

EddyLXJ force-pushed the export-D85830053 branch from 291ce24 to 07fb150 Compare November 14, 2025 22:13

meta-codesync bot closed this in 22baa4f Nov 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

st publish mode only load weight #3538

st publish mode only load weight #3538

Uh oh!

EddyLXJ commented Nov 11, 2025

Uh oh!

meta-codesync bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

st publish mode only load weight #3538

st publish mode only load weight #3538

Uh oh!

Conversation

EddyLXJ commented Nov 11, 2025

Uh oh!

meta-codesync bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant