Skip to content

Conversation

@ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Feb 13, 2025

What does this PR do?

It takes 3 minutes on main. This PR simply uses multiprocessing to make it faster.

@gante
Copy link
Contributor

gante commented Feb 13, 2025

Other scripts in make fixup would benefit from it too 🙏

(For completeness: A few months ago we had a conversation somewhere about adding multiprocessing or not. I remember the general consensus was to leave it as a last resort, as it may lead to other issues. Personally, I think the benefits outweigh the risks :D)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 13, 2025

Other scripts might be also addressed by multiprocessing, but it depends on if a script has a linear logic: check-copies might be a difficult one

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 13, 2025

@gante I just made it as an argument num_workers , default to 1 , and set it to 4 in CircleCI's .circleci/config.yml

@ydshieh ydshieh merged commit bfe46c9 into main Feb 13, 2025
9 of 11 checks passed
@ydshieh ydshieh deleted the speed_check branch February 13, 2025 16:25
Comment on lines +166 to +169
with multiprocessing.Pool(4) as p:
outputs = p.map(compare_files, new_ordered_files)
for output in outputs:
non_matching_files += output
Copy link
Contributor

@qubvel qubvel Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious couldn't we get unexpected results with multiprocessing here, for example

new_ordered_files = [modular_B(A), modular_C(B)]

With multiprocessing both of them would be converted "at the same time", but should be converted sequentially because C depends on the converted B model.

Copy link
Collaborator Author

@ydshieh ydshieh Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_ordered_files is computed sequentially. And the MP is only used (after new_ordered_files is computed) in the case where fix_and_overwrite=False, so there is no file change in this case :-)

ydshieh added a commit that referenced this pull request Feb 13, 2025
fix my bad

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Comment on lines +165 to +166

with multiprocessing.Pool(4) as p:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just going though that quickly and noticed that 4 is hardcoded here. It is expected @ydshieh? Probably should use args.num_workers no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh.... I forget to update this line after introducing args.num_workers. Thank you for spotting !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants