Serve a collection of custom models based on LRU

### Description

Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.

Here are the different use cases that could be handled:

* Thousands of models or just a few
* All models fit into memory or not
* A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)

### Implementation

cron:
1. update tree
2. for each model in memory, unload it if not it not in tree
3. for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory

request:
  - if not in tree:
    option 1: error
    option 2: if in S3: update tree; else error
  - if not on disk: download model
  - if not in memory: load into memory
    - if cache is too big, evict based on LRU
  - predict()

python:
* user defines `load_model(self, disk_path)`:

### Open Questions
 
* Where to put Python cache helper
* How to unload models from memory? Anything special for GPU?
* pre-download and/or pre-load during `init()`?

#### config questions

* Should cron interval be configurable?
* should the default have a cache size, or infinite (i.e. no eviction)?
* `model_dir` would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the given `model_dir` path, within which multiple versions of the said model can be found.
* The model disk cache size can be >= the model cache size (which resides in memory). A `disk_model_cache_size` field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be a `model_cache_size` field that would control the number of models that can be fit in memory at any point in time.
* Should be able to point to the dynamic list (`model_dir`) or to a list of static models (`models`). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.
* How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).

### Notes

* LRU memory cache and disk cache
* volumes are not shared across replicas
* threads_per_process > 1 is supported for TensorFlow and Python
* processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
* When serving, the requester may decide to use the `latest` version of a given model or a specific version of it (i.e. `v1`). If it's not specified, then resort to using the `latest`.
* latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)

### Additional Context

- https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serve a collection of custom models based on LRU #619

Description

Implementation

Open Questions

config questions

Notes

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serve a collection of custom models based on LRU #619

Description

Description

Implementation

Open Questions

config questions

Notes

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions