-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: preload textual model #12729
base: main
Are you sure you want to change the base?
feat: preload textual model #12729
Conversation
I was thinking of handling this with the I think it'd be simpler and more generic, although the advantage of this PR is that the model can still be unloaded when not in use. What do you think? |
Yes, that's exactly why I worked on it. I don't want to have a clip model loaded all the time, I want it loaded only when I (probably) need it. From my tests, with the model I use (XLM-Roberta-Large-Vit-B-16Plus), it adds almost 2 GB of data to my RAM, it's nice to have it unloaded when I don't use Immich 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! This is a great way to lower latency for the first search without committing to 24/7 memory usage. That being said, there are some refinements I'd like to see.
machine-learning/app/schemas.py
Outdated
class LoadModelEntry(InferenceEntry): | ||
ttl: int | ||
|
||
def __init__(self, name: str, task: ModelTask, type: ModelType, options: dict[str, Any], ttl: int): | ||
super().__init__(name=name, task=task, type=type, options=options) | ||
|
||
if ttl <= 0: | ||
raise ValueError("ttl must be a positive integer") | ||
self.ttl = ttl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InferenceEntry
is just a type for a dict, so this __init__
shouldn't exist.
Also, I have some reservations about the ttl
field here. TTL is currently set by an env and assumed to be the same for all models. At minimum, you would need to change the idle shutdown conditions to not rely on the env anymore and instead (synchronously) check if the model cache is empty.
Moreover, I'm not sure if this is something that the caller should be able to decide. The relationship between server and ML isn't necessarily one-to-one, so the settings should be designed with that in mind. If you have multiple servers that share a machine learning service, the effective TTL now depends on the order of requests. TTL is also relevant to whoever is deploying ML and may not be something they want a caller to be able to configure.
@Type(() => LoadTextualModelOnConnection) | ||
@ValidateNested() | ||
@IsObject() | ||
loadTextualModelOnConnection!: LoadTextualModelOnConnection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be less wordy? Maybe preloadTextualModel
?
Also, I imagine there are a lot of sessions where nothing gets searched and the model is loaded unnecessarily. If this were instead done when the user clicks the search bar, it'd almost always be a positive and could be enabled by default.
@@ -68,6 +75,16 @@ export class EventRepository implements OnGatewayConnection, OnGatewayDisconnect | |||
queryParams: {}, | |||
metadata: { adminRoute: false, sharedLinkRoute: false, uri: '/api/socket.io' }, | |||
}); | |||
if ('background' in client.handshake.query && client.handshake.query.background === 'false') { | |||
const { machineLearning } = await this.configCore.getConfig({ withCache: true }); | |||
if (machineLearning.clip.loadTextualModelOnConnection.enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should check if smart search is enabled.
Do we want to offer multiple strategies for when to load the textual model (on client connection, on search bar click...)? I guess loading model may vary a lot from one instance to another |
The connection approach is better if the model takes more than a few seconds to load and the user searches right away, but it's worse outside of that. The TTL aspect makes it hard to optimize: the model can be unloaded before they search, or it can stay loaded longer than needed because you don't know whether it'll get used. I think the search bar approach is a safer bet for now, but I'm open to having different options if it turns out that it isn't enough. |
We can use the |
Just to add a frame of reference, it takes me 25 seconds to load-and-search (using In this specific scenario, preloading when the search bar is clicked would make a negligible impact on search time |
I recently changed my clip model and noticed that the first search is terribly slow because the textual model is loaded only during the first search.
This PR adds a new setting that allows to load the text model on client connection with the websocket thus improving the time it takes for the first search a bit (this improvement depends on many factors like cpu power, model size...). To differentiate a client connection used for the background upload and a "normal" session, it only loads the model if the websocket URL has the query
background=false
Todo:
machine-learning.repository