diff --git a/README.md b/README.md index 74e6a9a05..940d041cb 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ [![codecov](https://codecov.io/github/eth-easl/modyn/graph/badge.svg?token=KFDCE03SQ4)](https://codecov.io/github/eth-easl/modyn) [![License](https://img.shields.io/github/license/eth-easl/modyn)](https://img.shields.io/github/license/eth-easl/modyn) -Modyn is an open-source platform for model training on dynamic datasets, i.e., datasets where points get added or removed over time. +Modyn is an open-source platform for model training on growing datasets, i.e., datasets where points get added over time. @@ -60,16 +60,16 @@ Please reach out via Github, Twitter, E-Mail, or any other channel of communicat How to [contribute](docs/CONTRIBUTING.md). ## 🔁 What are dynamic datasets and what is Modyn used for? -ML is is often applied in use cases where training data evolves and/or grows over time, i.e., datasets are _dynamic_ instead of static. +ML is is often applied in use cases where training data grows over time, i.e., datasets are _growing_ instead of static. Training must incorporate data changes for high model quality, however this is often challenging and expensive due to large datasets and models. -With Modyn, we are actively developing an open-source platform that manages dynamic datasets at scale and supports pluggable policies for when and what data to train on. +With Modyn, we are actively developing an open-source platform that manages growing datasets at scale and supports pluggable policies for when and what data to train on. Furthermore, we are developing a representative open-source benchmarking suite for ML training on dynamic datasets. The unit of execution in Modyn is a _pipeline_. -At minimum, a pipeline consists of (1) the model specification, (2) the training dataset, and a corresponding byte parsing function that defines how to convert raw sample bytes to model input, (3) the trigger policy, (4) the data selection policy, (5) training hyperparameters such as optimization criterion, optimizer, learning rate, batch size, and (6) training configuration such as data processing workers, whether to use automatic mixed precision, etc. +At minimum, a pipeline consists of (1) the model specification, (2) the training dataset and a corresponding byte parsing function that defines how to convert raw sample bytes to model input, (3) the triggering policy, (4) the data selection policy, (5) training hyperparameters such as the the learning rate and batch size, (6) training configuration such as data processing workers and number of GPUs, and (7) the model storage policy, i.e., a definition how the models are compressed and stored. Checkout our [Example Pipeline](docs/EXAMPLE.md) guide for an example on how to run a Modyn pipeline. -Modyn allows researchers to explore training and data selection policies (see [Technical Guidelines](docs/TECHNICAL.md) on how to add new policies to Modyn), while alleviating the burdens of managing large dynamic datasets and orchestrating recurring training jobs. +Modyn allows researchers to explore triggering and data selection policies (see [Technical Guidelines](docs/TECHNICAL.md) on how to add new policies to Modyn), while alleviating the burdens of managing large growing datasets and orchestrating recurring training jobs. However, we strive towards usage of Modyn in practical environments as well. We welcome input from both research and practice.