Enable model and data sharding #96

gianlucadetommaso · 2023-07-02T21:34:36Z

To scale to very large model, we need to enable model parallelization and FSDP in Fortuna. We can do so by exploiting JAX sharding functionalities. In this PR, we plan to make training and fine-tuning methods in Fortuna working with shardings. The user will determine how many GPUs to allocate for each type of parallelization (data, FSD and model parallelization) and some simple model partitioning rules. Fortuna will do the rest.

Pull request type

Please check the type of change your PR introduces:

- create partition manager object - make MAP compatible - migrate to Orbax checkpointing - refactor predictive

gianlucadetommaso added 26 commits May 15, 2023 19:07

edit installation instructions in readme

52e96ea

Merge branch 'main' of https://github.com/awslabs/fortuna

5e0076d

Merge branch 'main' of https://github.com/awslabs/fortuna

4c7fd28

bump up version

6cb6581

Merge branch 'main' of https://github.com/awslabs/fortuna

1b39780

Merge branch 'main' of https://github.com/awslabs/fortuna

cb2b49a

Merge branch 'main' of https://github.com/awslabs/fortuna

14e3ca4

Merge branch 'main' of https://github.com/awslabs/fortuna

580067d

Merge branch 'main' of https://github.com/awslabs/fortuna

048ef09

Merge branch 'main' of https://github.com/awslabs/fortuna

ad542a4

Merge branch 'main' of https://github.com/awslabs/fortuna

41417c1

Merge branch 'main' of https://github.com/awslabs/fortuna

64be374

Merge branch 'main' of https://github.com/awslabs/fortuna

a2d0f34

Merge branch 'main' of https://github.com/awslabs/fortuna

66bba06

Merge branch 'main' of https://github.com/awslabs/fortuna

911aa82

Merge branch 'main' of https://github.com/awslabs/fortuna

01f959b

Merge branch 'main' of https://github.com/awslabs/fortuna

79f8dca

add sequence probit

99a3b78

add possibility to run sequential probit on last steps only

1c23a9e

Merge branch 'main' of https://github.com/awslabs/fortuna

4dea50f

Merge branch 'main' into seqprobit

915a1ea

refactor sequential probit implementation

e966745

add stop gradient flag

529f9aa

pre-commit

42d2117

add probit options in example script

734f597

mesh

404840e

gianlucadetommaso marked this pull request as draft July 2, 2023 21:34

enable model and data sharding

4444907

- create partition manager object - make MAP compatible - migrate to Orbax checkpointing - refactor predictive

gianlucadetommaso force-pushed the mesh branch from 85303f3 to 4444907 Compare July 6, 2023 16:14

make further changes after training roberta

830fbe8

gianlucadetommaso added 8 commits July 16, 2023 22:15

further changes

e3e1c4f

refactoring laplace

6d47a47

start debugging swag

ed571de

Merge branch 'main' of https://github.com/awslabs/fortuna

1ced008

Merge branch 'main' of https://github.com/awslabs/fortuna

6992692

make small change in readme because of publish to pypi error

b2540c1

Merge branch 'main' of https://github.com/awslabs/fortuna

2362998

debug deep ensemble

ba52081

gianlucadetommaso mentioned this pull request Jul 18, 2023

bug: ConcretizationTypeError when trying to use prob_model.predictive() #101

Closed

fix sghmc and sgld

d2fc289

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable model and data sharding #96

Enable model and data sharding #96

Uh oh!

gianlucadetommaso commented Jul 2, 2023

Uh oh!

Uh oh!

Enable model and data sharding #96

Are you sure you want to change the base?

Enable model and data sharding #96

Uh oh!

Conversation

gianlucadetommaso commented Jul 2, 2023

Pull request type

Uh oh!

Uh oh!