Skip to content

Expert Parallelism #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 95 commits into
base: dev
Choose a base branch
from
Open

Expert Parallelism #361

wants to merge 95 commits into from

Conversation

xrsrke
Copy link
Member

@xrsrke xrsrke commented Apr 29, 2025

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guidelines?
  • Did you write any new necessary tests?
  • Did you log the throughput and loss you get to ensure the PR works as expected in actual training?
  • Did you log the memory usage? you can use this tool to understand the memory usage breakdown in nanotron.
  • If you modified anything related to checkpoints, did you verify that saving and reloading checkpoints still works correctly?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

NouamaneTazi and others added 30 commits April 14, 2025 16:17
* can only merge to main from dev

* Fix UnBoundLocalError in `clm_collator.py` (#339)

* Update clm_collator.py

* can only merge to main from dev (#348)

---------

Co-authored-by: Nouamane Tazi <nouamane@huggingface.co>

* fix init and init scaling factor and run evals in background (#349)

* InitScalingMethod

* InitScalingMethod

* run evals in background (#352)

* eval

* try adding lightevalrunner to trainer

* amend

* amend

* amend

* amend

* amend

* amend

* .

* amend

* amend

* .

* qos to low

* add nanotron_path

* some fix: logs, and config

* cp instead of sync

* eval_interval

* serialize sanity checks

* add output dir and s3_save path in the config

* fix s3 only if define

* fixes

---------

Co-authored-by: elie <97572401+eliebak@users.noreply.github.com>
Co-authored-by: “eliebak” <elie.bakouch@huggingface.co>

---------

Co-authored-by: elie <97572401+eliebak@users.noreply.github.com>
Co-authored-by: “eliebak” <elie.bakouch@huggingface.co>

---------

Co-authored-by: Connector Switch <c8ef@outlook.com>
Co-authored-by: elie <97572401+eliebak@users.noreply.github.com>
Co-authored-by: “eliebak” <elie.bakouch@huggingface.co>
inference seems good rn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants