Skip to content

Serve a collection of custom models based on LRU #619

Closed
@vishalbollu

Description

@vishalbollu

Description

Add support to serve many different models where each model fulfills a subset of possible input (i.e. city-based models). Because each model is designed for only a subset of input queries, certain models may be queried more often than others. Serve the top N most queried models. Load and unload models based on LRU.

Here are the different use cases that could be handled:

  • Thousands of models or just a few
  • All models fit into memory or not
  • A static list of models, or dynamic (e.g. point to an s3 prefix, and then you can add new models after the API is deployed)

Implementation

cron:

  1. update tree
  2. for each model in memory, unload it if not it not in tree
  3. for each model in memory which has a latest timestamp: if there is a new version && (timestamp on latest is newer than oldest timestamp currently in cache or the cache has space): download it and load in memory

request:

  • if not in tree:
    option 1: error
    option 2: if in S3: update tree; else error
  • if not on disk: download model
  • if not in memory: load into memory
    • if cache is too big, evict based on LRU
  • predict()

python:

  • user defines load_model(self, disk_path):

Open Questions

  • Where to put Python cache helper
  • How to unload models from memory? Anything special for GPU?
  • pre-download and/or pre-load during init()?

config questions

  • Should cron interval be configurable?
  • should the default have a cache size, or infinite (i.e. no eviction)?
  • model_dir would be a configurable field of an S3 path / local path that points to a big pool of models. This is where all required models are pulled in from. The name of the model (or its unique identifier) represents a name of a directory within the given model_dir path, within which multiple versions of the said model can be found.
  • The model disk cache size can be >= the model cache size (which resides in memory). A disk_model_cache_size field should exist in the Cortex config. All cached models must fit on the disk/memory. There should also be a model_cache_size field that would control the number of models that can be fit in memory at any point in time.
  • Should be able to point to the dynamic list (model_dir) or to a list of static models (models). The static list of models won't have the updating mechanism for models, and thus no version-selection is possible when making predictions.
  • How should we handle OOM issues when making predictions? When a prediction is made, some memory has to be allocated for tensors, and this could exceed the available system memory (RAM or VRAM).

Notes

  • LRU memory cache and disk cache
  • volumes are not shared across replicas
  • threads_per_process > 1 is supported for TensorFlow and Python
  • processes_per_replica > 1 is not supported on Python, maybe supported on TensorFlow (if easy)
  • When serving, the requester may decide to use the latest version of a given model or a specific version of it (i.e. v1). If it's not specified, then resort to using the latest.
  • latest has its own timestamp, separate from each version. When evicting from the cache, the latest timestamp will be associated with the latest model (even though there may not be a timestamp associated with that model)

Additional Context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions