Skip to content

Commit

Permalink
Merge branch 'main' into robinholzi/feat/nbstripout
Browse files Browse the repository at this point in the history
  • Loading branch information
robinholzi authored Jul 24, 2024
2 parents 45045af + 2f29dc4 commit 6c4a11f
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
[![codecov](https://codecov.io/github/eth-easl/modyn/graph/badge.svg?token=KFDCE03SQ4)](https://codecov.io/github/eth-easl/modyn)
[![License](https://img.shields.io/github/license/eth-easl/modyn)](https://img.shields.io/github/license/eth-easl/modyn)

Modyn is an open-source platform for model training on dynamic datasets, i.e., datasets where points get added or removed over time.
Modyn is an open-source platform for model training on growing datasets, i.e., datasets where points get added over time.

</div>

Expand Down Expand Up @@ -60,16 +60,16 @@ Please reach out via Github, Twitter, E-Mail, or any other channel of communicat
How to [contribute](docs/CONTRIBUTING.md).

## 🔁 What are dynamic datasets and what is Modyn used for?
ML is is often applied in use cases where training data evolves and/or grows over time, i.e., datasets are _dynamic_ instead of static.
ML is is often applied in use cases where training data grows over time, i.e., datasets are _growing_ instead of static.
Training must incorporate data changes for high model quality, however this is often challenging and expensive due to large datasets and models.
With Modyn, we are actively developing an open-source platform that manages dynamic datasets at scale and supports pluggable policies for when and what data to train on.
With Modyn, we are actively developing an open-source platform that manages growing datasets at scale and supports pluggable policies for when and what data to train on.
Furthermore, we are developing a representative open-source benchmarking suite for ML training on dynamic datasets.

The unit of execution in Modyn is a _pipeline_.
At minimum, a pipeline consists of (1) the model specification, (2) the training dataset, and a corresponding byte parsing function that defines how to convert raw sample bytes to model input, (3) the trigger policy, (4) the data selection policy, (5) training hyperparameters such as optimization criterion, optimizer, learning rate, batch size, and (6) training configuration such as data processing workers, whether to use automatic mixed precision, etc.
At minimum, a pipeline consists of (1) the model specification, (2) the training dataset and a corresponding byte parsing function that defines how to convert raw sample bytes to model input, (3) the triggering policy, (4) the data selection policy, (5) training hyperparameters such as the the learning rate and batch size, (6) training configuration such as data processing workers and number of GPUs, and (7) the model storage policy, i.e., a definition how the models are compressed and stored.
Checkout our [Example Pipeline](docs/EXAMPLE.md) guide for an example on how to run a Modyn pipeline.

Modyn allows researchers to explore training and data selection policies (see [Technical Guidelines](docs/TECHNICAL.md) on how to add new policies to Modyn), while alleviating the burdens of managing large dynamic datasets and orchestrating recurring training jobs.
Modyn allows researchers to explore triggering and data selection policies (see [Technical Guidelines](docs/TECHNICAL.md) on how to add new policies to Modyn), while alleviating the burdens of managing large growing datasets and orchestrating recurring training jobs.
However, we strive towards usage of Modyn in practical environments as well.
We welcome input from both research and practice.

Expand Down

0 comments on commit 6c4a11f

Please sign in to comment.