Description
Feature description
In the current state of browsermt/marian-dev, the concept of a workspace which manages allocation of tensors is placed behind a graph accessible to the library API bergamot-translator uses. This leads to a temporarily inefficient implementation of multiple-models handling (browsermt/bergamot-translator#210), where the workspaces grow proportional to the number of models active.
@XapaJIaMnu and @kpu have previously solved swapping multiple models by means of swapping tensors onto an active graph. This is "dynamic" and a reference implementation available at https://github.com/kpu/marian-dev/blob/dynamic_swap_mvp/src/translator/swappable.cpp. While this is doable in the case of shared-architectures without incurring much expense, a change in architecture involves reconstructing the graph (eg: tied embedding model swapped out for a non-tied embedding model).
It is optimal to keep the concept of a workspace bound to threads/workers active instead, separate the graph and architecture aside to avoid the blow-up in memory usage than what is originally required.
This issue is intended to investigate how best to make the modifications to solve the above problem in this repository.
/cc @graemenail