Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abra merge test #2870

Merged
merged 73 commits into from
Jan 17, 2024
Merged

Abra merge test #2870

merged 73 commits into from
Jan 17, 2024

Conversation

cli99
Copy link
Contributor

@cli99 cli99 commented Jan 17, 2024

This PR merges the dev (31ea664) into abra.

It's been tested with pytorch 2.12 + llm-foundry w/o dtensor.

j316chuck and others added 30 commits November 30, 2023 01:47
…anonical solution length (#2682)

* sequentialize generations_per_sample

* fix bug

* lower generation length

* lower generation length

* lower generation length

* fix gen len

* restore

* restore

* restore

* fix tests

* fix test
* remove flatten params

* simplify tests

* simplify tests

* clean

* fix more tests

* rerun tests

* speed up icl

* fix tests

* fix cpu tests

* add more fixtures

* fix tests

* token count

* fix vocab size

* remove logger

* remove clears

* fix mosaicml logger

* change codeowners

* clean up codeowners

* rerun tests

* shrink dataset

* fix tests

* fix test

* rerun tests

* fix tests

* fix tests

* fix seed

* set to 0

* rerun tests

* rerun tests

* change threshold

* rerun tests

* rerun tests

* logs

* remove changes

* logs

* logs

* remove logs

* rerun tests

* rerun tests

* logs

* rerun

* logs

* rerun

* rerun

* rerun tests

* many more logs

* rerun tests

* strip logs

* enable tests

* remove opt

* rerun tests

* add test

* lint

* rerun tests

* fix lint

* lint

* filter warnings

* rerun tests

* fixture

* add fixture

* change

* logs

* rerun tests

* add logs

* rerun tests

* fixture

* lint

* lint

* rerun tests

* fix ignore warning

* logs

* regex

* regex

* regex

* fix

* logs

* reformat
* change token math

* tokens

* add test

* fix tests
* time to clean up time parsing

* fix type error

* updates
* Upgrade RunConfig compute specification

* extra cluster
* async mlflow logging

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

* small fix

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

* clean up

* fix test

* fix tests

* deflake

* pin mlflow

---------

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
…r. (#2771)

* v1

* fix issues

* add logs

* change names

* comment

* add device

* uncomment original trace

* add custome plot

* fix pyright

* Update composer/profiler/torch_profiler.py

Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com>

* address comments

* fix code check

* fix formatting

* address comments

* add unit test

* fix check

* fix check

* fix check

* fix check

* fix print

* add test comment

* add test comment

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Charles Tang <j316chuck@users.noreply.github.com>
* improve torch profile args

* improve torch profile args

* change default torch_prof_memory_filename

* add memory profiling arg test

* fix check

* fix check

* fix check

* fix check

* fix check

* fix check
* fix checkpoint validation tests for torch 1.13

* Fix
* bump version

* 0.17.2

* update matrix
Bumps [sphinxext-opengraph](https://github.com/wpilibsuite/sphinxext-opengraph) from 0.9.0 to 0.9.1.
- [Release notes](https://github.com/wpilibsuite/sphinxext-opengraph/releases)
- [Commits](wpilibsuite/sphinxext-opengraph@v0.9.0...v0.9.1)

---
updated-dependencies:
- dependency-name: sphinxext-opengraph
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [coverage[toml]](https://github.com/nedbat/coveragepy) from 7.3.0 to 7.3.3.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.3.0...7.3.3)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Updates the requirements on [torch](https://github.com/pytorch/pytorch) to permit the latest version.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
- [Commits](pytorch/pytorch@v1.13.1...v2.1.2)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Enable system metrics in mosaic mlflow logger

* remove fixture

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

* Update composer/loggers/mlflow_logger.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
* add custome gen kwargs and stopping on eos token

* modify test

* modify test

* finish

* finish

* finish

* finish
mvpatel2000 and others added 17 commits January 9, 2024 18:44
* fixes to get dtensor to work

* more fixes

* Change state dict materialization for new version of torch

* get load working for new set_state_dict api

* use device_mesh

* Add fsdp init monkeypatch for DTensor

* Add checkpoint profiling logs

* attempt

* working single node

* fix optimizer

* allow 3d device mesh

* attempt to use different pg during 3d mesh save

* undo 3d mesh changes

* load_state_dict -> load

* allow parent mesh in FSDP init

* allow override of force_sync_module_states

* remove unnecessary exit

* ignore _validate_and_get_shard_state()

* save/load hsdp-moe working

* remove prints

* v1

* v2

* lint

* add more tests

* switch to PRs

* ignore warning

* fix lint

* version error

* fix version

* fix state dict

* update versions

* lint

* lint

* disable lint for mosaic fsdp utils

* remove bad line

* move around for legacy

* device mesh

* ignore warning

* fix import

* always init

* fix error

* fix load planner

* remove

* fix lint

* lint

* delay state dict

* test checkpoint

* checkpoint

* fix cpu tests

* fix rotate tests

* fix precision

* lint

* fix alibi

* cleanup

* cleanup

* remove force sync

* fix type

* merge

* lint

* fix gpt

* comment

* fix test

* lint

* minor optimizations

* Update composer/core/state.py

Co-authored-by: Evan Racah <evan@mosaicml.com>

* revert tests

---------

Co-authored-by: Evan Racah <ejracah@gmail.com>
Co-authored-by: Abhinav Venigalla <abhi.venigalla@databricks.com>
Co-authored-by: root <23239305+b-chu@users.noreply.github.com>
Co-authored-by: Abhinav Venigalla <abhi@mosaicml.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Evan Racah <evan@mosaicml.com>
* first try

* add context

* lint

* more lint

* remove comment

---------

Co-authored-by: Daniel King <daniel@mosaicml.com>
Co-authored-by: Your Name <you@example.com>
* bump torch

* bump

* bump
* Support checkpoint uploads to MLFlow (untested)

Use MLFlow run tag for autoresume

Add MLFlowLogger test for existing composer run tag

* Try formatting mlflow save folder after INIT

Make MLFlow experiment and run ID available on all ranks

Fix path issue

Format mlflow placeholders in remote filenames

* Unit tests for partial_format

* Log mlflow info as hyperparams

* partial_format doc update

* Fix formatting

* Pull distributed logic out of MLFlowObjectStore

Add debug tracebacks

Bugfix

Add path to debug info

Try fixing RUD object store init

Pyright

* Partial format in format_name helpers

* Fix import

* Add extra partial_format test

* Fix mlflow RUD check

* Fix test

pyright

No longer expect KeyError for format_with_dist using partial_format

Refactor partial_format for readability

* Max iters on partial_format

* Fix partial_format

* Clean up

* fix test import

* Fix test
* update nightly to torch 2.3

* tighten

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
* pin release

* bump

* break pypi

* tighter pin

* pin

* pin

* pin
* add monkeypatch for verify_options

* patch

* fix

* fix

* partial precommit

* bit of cleanup

* doc

* debug

* fix version pinning

* precommit

* checkdown

* lint

---------

Co-authored-by: Evan Racah <ejracah@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
…2866)

Updates the requirements on [mosaicml-cli](https://github.com/mosaicml/mosaicml-cli) to permit the latest version.
- [Commits](https://github.com/mosaicml/mosaicml-cli/commits)

---
updated-dependencies:
- dependency-name: mosaicml-cli
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* checkdown

* checkdown

* lint

* fix

* load ignore keys

* fix

* resolve comments

* fix load ignore keys

* offload

* fix gate

* merge

* lint

* use flag

* force trye
* add custome gen kwargs and stopping on eos token

* modify test

* modify test

* finish

* finish

* finish

* finish

* finish pr

* implement early stop

* add tesT

* fix bug

* bug fix

* add keys

* diff split

* fix typo

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix precommit

* fix conditional import

* add nlp metrics

* remove code gen changes

* fix nits

---------

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
* comment

* add it

* debug

* add the keys

* debug

* debug

* remove print statement

* docs and tests

* fix tests

---------

Co-authored-by: Daniel King <daniel@mosaicml.com>
@cli99 cli99 requested review from b-chu and mvpatel2000 January 17, 2024 01:55
@cli99 cli99 marked this pull request as ready for review January 17, 2024 01:55
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this Cheng!

@cli99 cli99 merged commit 76a0e43 into abra Jan 17, 2024
5 of 14 checks passed
@cli99 cli99 deleted the abra-merge-test branch January 17, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.