Closed
Description
Description
Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.
Here are the different use cases that could be handled:
- Thousands of models or just a few
- All models fit into memory or not
- A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)
Implementation
cron:
- update tree
- for each model in memory, unload it if not it not in tree
- for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory
request:
- if not in tree:
option 1: error
option 2: if in S3: update tree; else error - if not on disk: download model
- if not in memory: load into memory
- if cache is too big, evict based on LRU
- predict()
python:
- user defines
load_model(self, disk_path)
:
Open Questions
- Where to put Python cache helper
- How to unload models from memory? Anything special for GPU?
- pre-download and/or pre-load during
init()
?
config questions
- Should cron interval be configurable?
- should the default have a cache size, or infinite (i.e. no eviction)?
model_dir
would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the givenmodel_dir
path, within which multiple versions of the said model can be found.- The model disk cache size can be >= the model cache size (which resides in memory). A
disk_model_cache_size
field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be amodel_cache_size
field that would control the number of models that can be fit in memory at any point in time. - Should be able to point to the dynamic list (
model_dir
) or to a list of static models (models
). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions. - How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).
Notes
- LRU memory cache and disk cache
- volumes are not shared across replicas
- threads_per_process > 1 is supported for TensorFlow and Python
- processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
- When serving, the requester may decide to use the
latest
version of a given model or a specific version of it (i.e.v1
). If it's not specified, then resort to using thelatest
. - latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)