Skip to content

feat: add ModelWeightsManager to auto-sync HF weights on endpoint creation#761

Open
lilyz-ai wants to merge 4 commits intomainfrom
MLI-5608
Open

feat: add ModelWeightsManager to auto-sync HF weights on endpoint creation#761
lilyz-ai wants to merge 4 commits intomainfrom
MLI-5608

Conversation

@lilyz-ai
Copy link
Collaborator

@lilyz-ai lilyz-ai commented Feb 20, 2026

Summary

  • Adds a new ModelWeightsManager class that automatically checks for cached model weights in S3/GCS/ABS and downloads from HuggingFace Hub if missing, eliminating the need to manually run sync_model_weights.py before creating an endpoint
  • Adds upload_files() to LLMArtifactGateway (abstract) with implementations for S3, GCS, and ABS
  • Wires ModelWeightsManager into CreateLLMModelEndpointV1UseCase — triggers only when checkpoint_path is None and source == HUGGING_FACE
  • Fixes huggingface_hub.utils._errors import compatibility for hub >= 0.36

How it works

POST /v1/llm/model-endpoints
  └─ ModelWeightsManager.ensure_model_weights_available(hf_repo)
       ├─ list_files(s3_path) → non-empty → return early (cache hit, zero I/O)
       ├─ snapshot_download(hf_repo) → local_dir   [runs in thread executor]
       └─ upload_files(local_dir, s3_path)          [runs in thread executor]

Both HF download and S3 upload run in run_in_executor to avoid blocking the async event loop.

Test plan

  • test_cache_hit_skips_download — no download/upload on cache hit
  • test_cache_hit_returns_correct_s3_path — correct {prefix}/{hf_repo} path returned
  • test_cache_miss_calls_snapshot_download_and_upload — HF download + upload called with correct args
  • test_s3_path_construction — trailing slash stripped correctly from prefix
  • test_create_llm_model_endpoint_calls_weights_manager_on_hf_source — weights manager invoked from use case, resolved path forwarded to bundle creation
  • All 7 tests pass locally (pytest tests/unit/domain/test_model_weights_manager.py)

🤖 Generated with Claude Code

Greptile Summary

Adds a ModelWeightsManager that automatically downloads model weights from HuggingFace Hub and uploads them to cloud storage (S3/GCS/ABS) in the background when creating an endpoint with checkpoint_path=None and source=HUGGING_FACE. This eliminates the need to manually run sync_model_weights.py before endpoint creation. The PR also adds upload_files() to the LLMArtifactGateway abstract class with implementations for all three cloud backends, wires the manager into the endpoint creation use case, adds a K8s init container to poll for weight availability, and fixes huggingface_hub import compatibility for hub >= 0.36.

  • The hf_weights_syncing flag is never cleared in the database after a successful sync, causing recover_hf_syncs to re-trigger downloads on every server restart for all endpoints that ever used this feature
  • The K8s init container poll script (_HF_WEIGHTS_POLL_SCRIPT) hardcodes boto3/S3 API calls, so it will fail at runtime for GCS and Azure deployments despite the upload logic supporting all three backends
  • The init container poll loop has no timeout — if the background sync fails, the pod hangs indefinitely
  • The upload_files() implementations and test coverage are solid across all three cloud providers

Confidence Score: 2/5

  • This PR introduces meaningful risks: the init container is S3-only despite multi-cloud support, and the hf_weights_syncing flag is never reset, causing perpetual re-triggering on restarts.
  • Score of 2 reflects two runtime-impacting issues: (1) the init container poll script only works on S3, silently breaking GCS/Azure deployments with an infinite hang, and (2) the hf_weights_syncing flag persists forever in the DB, causing recover_hf_syncs to re-trigger syncs on every server restart. The core ModelWeightsManager logic and upload implementations are well-structured, but these integration issues need addressing before merging.
  • Pay close attention to k8s_endpoint_resource_delegate.py (S3-only init container with no timeout) and app.py (perpetual re-trigger in recover_hf_syncs due to hf_weights_syncing never being cleared).

Important Files Changed

Filename Overview
model-engine/model_engine_server/domain/use_cases/model_weights_manager.py New ModelWeightsManager class that downloads HF weights in background and uploads to cloud storage. Core logic is sound but uses deprecated asyncio API.
model-engine/model_engine_server/api/app.py Adds startup hooks for model weights manager initialization and HF sync recovery. Recovery logic queries DB and re-triggers syncs.
model-engine/model_engine_server/api/llms_v1.py Wires model_weights_manager from app state into CreateLLMModelEndpointV1UseCase. Clean integration with existing route handler.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Integrates weights manager into endpoint creation flow. Skips validate_checkpoint_files when list is empty, which changes error behavior.
model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py Adds init container with S3-only poll script and no timeout. Only works on AWS despite multi-cloud upload support.
model-engine/model_engine_server/domain/entities/llm_entity.py Adds hf_weights_syncing boolean field to LLMMetadata dataclass with default False.
model-engine/model_engine_server/domain/gateways/llm_artifact_gateway.py Adds abstract upload_files method to LLMArtifactGateway. Clean interface addition.
model-engine/model_engine_server/infra/gateways/s3_llm_artifact_gateway.py S3 upload_files implementation using os.walk and bucket.upload_file. Consistent with existing patterns.
model-engine/model_engine_server/infra/gateways/gcs_llm_artifact_gateway.py GCS upload_files implementation. Follows same pattern as S3 with GCS-specific blob upload.
model-engine/model_engine_server/infra/gateways/abs_llm_artifact_gateway.py Azure Blob Storage upload_files implementation. Follows same pattern as S3/GCS implementations.
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py Fixes huggingface_hub import compatibility for hub >= 0.36 with try/except fallback.
model-engine/tests/unit/conftest.py Adds upload_files no-op to FakeLLMArtifactGateway to satisfy abstract method requirement.
model-engine/tests/unit/domain/test_model_weights_manager.py Comprehensive unit tests covering cache hit, cache miss, deduplication, task lifecycle, and integration with use case.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as POST /v1/llm/model-endpoints
    participant UseCase as CreateLLMModelEndpointV1UseCase
    participant MWM as ModelWeightsManager
    participant BG as Background Task
    participant Storage as S3/GCS/ABS
    participant HF as HuggingFace Hub
    participant K8s as K8s Init Container

    Client->>API: Create endpoint (no checkpoint_path)
    API->>UseCase: execute(request)
    UseCase->>MWM: ensure_model_weights_available(hf_repo)
    MWM-->>UseCase: remote_path (immediate return)
    MWM->>BG: asyncio.create_task(_sync_weights)
    UseCase->>UseCase: Create bundle + endpoint (hf_weights_syncing=true)
    UseCase-->>Client: 200 OK (endpoint created)

    Note over BG,Storage: Background (non-blocking)
    BG->>Storage: list_files(remote_path)
    alt Cache hit
        Storage-->>BG: files[] (non-empty)
        BG->>BG: return early
    else Cache miss
        Storage-->>BG: [] (empty)
        BG->>HF: snapshot_download(hf_repo)
        HF-->>BG: local files
        BG->>Storage: upload_files(local, remote)
    end

    Note over K8s,Storage: Pod startup
    K8s->>Storage: Poll list_objects (every 30s)
    Storage-->>K8s: files found
    K8s->>K8s: exit(0) — main container starts
Loading

Last reviewed commit: 3a18aee

…ation

When a model endpoint is created via POST /v1/llm/model-endpoints with
source=HUGGING_FACE and no checkpoint_path, ModelWeightsManager now
automatically checks the configured S3/GCS/ABS prefix for cached weights
and downloads from HuggingFace Hub + uploads if missing — eliminating
the manual sync_model_weights.py step.

- Add ModelWeightsManager with ensure_model_weights_available()
  (async-safe via run_in_executor, cache-hit skips all I/O)
- Add upload_files() abstract method to LLMArtifactGateway with
  implementations for S3, GCS, and ABS
- Wire ModelWeightsManager into CreateLLMModelEndpointV1UseCase and
  the create_model_endpoint API handler
- Fix huggingface_hub.utils._errors import for hub>=0.36 compatibility
- Add unit tests covering cache hit/miss, path construction, and
  end-to-end integration with CreateLLMModelEndpointV1UseCase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

The remote path (s3://, gs://, or https://) where the weights are stored.
"""
remote_path = self._get_remote_path(hf_repo)
files = self.llm_artifact_gateway.list_files(remote_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_files() blocks the event loop

snapshot_download and upload_files are correctly offloaded via run_in_executor, but list_files() on line 46 is a synchronous I/O call (S3 ListObjects / GCS list_blobs / ABS list_blob_names) that runs directly on the async event loop. For consistency with the other two calls, this should also be wrapped in run_in_executor:

Suggested change
files = self.llm_artifact_gateway.list_files(remote_path)
files = await loop.run_in_executor(
None,
functools.partial(self.llm_artifact_gateway.list_files, remote_path),
)

Note: loop would need to be obtained before this line — move the loop = asyncio.get_event_loop() line above this call.

Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/model_weights_manager.py
Line: 46

Comment:
**`list_files()` blocks the event loop**

`snapshot_download` and `upload_files` are correctly offloaded via `run_in_executor`, but `list_files()` on line 46 is a synchronous I/O call (S3 `ListObjects` / GCS `list_blobs` / ABS `list_blob_names`) that runs directly on the async event loop. For consistency with the other two calls, this should also be wrapped in `run_in_executor`:

```suggestion
        files = await loop.run_in_executor(
            None,
            functools.partial(self.llm_artifact_gateway.list_files, remote_path),
        )
```

Note: `loop` would need to be obtained before this line — move the `loop = asyncio.get_event_loop()` line above this call.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +51 to +70
logger.info(f"Cache miss for {hf_repo}. Downloading from HuggingFace Hub...")
loop = asyncio.get_event_loop()
with tempfile.TemporaryDirectory() as tmp_dir:
await loop.run_in_executor(
None,
functools.partial(
snapshot_download,
repo_id=hf_repo,
local_dir=tmp_dir,
ignore_patterns=HF_IGNORE_PATTERNS,
),
)
await loop.run_in_executor(
None,
functools.partial(
self.llm_artifact_gateway.upload_files,
tmp_dir,
remote_path,
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unhandled errors from snapshot_download will crash endpoint creation

If snapshot_download fails (e.g., gated model requiring HF auth token, rate limiting, network timeout), the exception propagates uncaught and returns a 500 to the caller. Many models in SUPPORTED_MODELS_INFO (like meta-llama/*) are gated and require authentication. Consider wrapping this in a try/except that logs the error and either raises a user-friendly error or falls back to checkpoint_path = None (allowing downstream logic to handle it):

try:
    await loop.run_in_executor(...)
    await loop.run_in_executor(...)
except Exception as e:
    logger.error(f"Failed to download/upload weights for {hf_repo}: {e}")
    raise ObjectHasInvalidValueException(
        f"Could not download model weights for {hf_repo}. "
        "Ensure the model is accessible and try again, or provide a checkpoint_path explicitly."
    )
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/model_weights_manager.py
Line: 51-70

Comment:
**Unhandled errors from `snapshot_download` will crash endpoint creation**

If `snapshot_download` fails (e.g., gated model requiring HF auth token, rate limiting, network timeout), the exception propagates uncaught and returns a 500 to the caller. Many models in `SUPPORTED_MODELS_INFO` (like `meta-llama/*`) are gated and require authentication. Consider wrapping this in a try/except that logs the error and either raises a user-friendly error or falls back to `checkpoint_path = None` (allowing downstream logic to handle it):

```python
try:
    await loop.run_in_executor(...)
    await loop.run_in_executor(...)
except Exception as e:
    logger.error(f"Failed to download/upload weights for {hf_repo}: {e}")
    raise ObjectHasInvalidValueException(
        f"Could not download model weights for {hf_repo}. "
        "Ensure the model is accessible and try again, or provide a checkpoint_path explicitly."
    )
```

How can I resolve this? If you propose a fix, please make it concise.

ensure_model_weights_available is now synchronous — it returns the
expected checkpoint path immediately and fires a background asyncio task
to sync weights from HuggingFace Hub. An init container is injected
into the K8s deployment to poll storage until the weights are present
before the main container starts. LLMMetadata gains an hf_weights_syncing
flag to signal this flow downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +70 to +86
_HF_WEIGHTS_POLL_SCRIPT = """\
import boto3, os, sys, time
from urllib.parse import urlparse

cp = os.environ["CHECKPOINT_PATH"]
url = urlparse(cp)
bucket = url.netloc
prefix = url.path.lstrip("/")
s3 = boto3.client("s3")
while True:
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=1)
if resp.get("Contents"):
print(f"Model weights ready at {cp}", flush=True)
sys.exit(0)
print(f"Waiting for model weights at {cp}...", flush=True)
time.sleep(30)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init container poll script only supports S3

_HF_WEIGHTS_POLL_SCRIPT hardcodes boto3 and s3.list_objects_v2, but the system supports three storage backends: S3, GCS (gs://), and Azure Blob Storage (https://*.blob.core.windows.net). If hf_user_fine_tuned_weights_prefix points to GCS or ABS, this script will fail at runtime — urlparse("gs://bucket/key") gives netloc="bucket" but the S3 API call will raise an error.

The upload_files implementations in GCSLLMArtifactGateway and ABSLLMArtifactGateway correctly handle their respective backends, so the background sync will succeed, but the init container waiting for the weights will never detect them.

Consider either:

  • Dispatching based on the URL scheme (s3:// → boto3, gs://google.cloud.storage, https://*.blob.core.windows.net → azure SDK), or
  • Using a simpler polling mechanism (e.g., an HTTP HEAD request if the gateway can provide a presigned/public URL).
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py
Line: 70-86

Comment:
**Init container poll script only supports S3**

`_HF_WEIGHTS_POLL_SCRIPT` hardcodes `boto3` and `s3.list_objects_v2`, but the system supports three storage backends: S3, GCS (`gs://`), and Azure Blob Storage (`https://*.blob.core.windows.net`). If `hf_user_fine_tuned_weights_prefix` points to GCS or ABS, this script will fail at runtime — `urlparse("gs://bucket/key")` gives `netloc="bucket"` but the S3 API call will raise an error.

The `upload_files` implementations in `GCSLLMArtifactGateway` and `ABSLLMArtifactGateway` correctly handle their respective backends, so the background sync will succeed, but the init container waiting for the weights will never detect them.

Consider either:
- Dispatching based on the URL scheme (`s3://` → boto3, `gs://``google.cloud.storage`, `https://*.blob.core.windows.net` → azure SDK), or
- Using a simpler polling mechanism (e.g., an HTTP HEAD request if the gateway can provide a presigned/public URL).

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +79 to +85
while True:
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=1)
if resp.get("Contents"):
print(f"Model weights ready at {cp}", flush=True)
sys.exit(0)
print(f"Waiting for model weights at {cp}...", flush=True)
time.sleep(30)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No timeout — pod hangs forever if weight sync fails

The while True loop has no timeout or maximum iteration count. If the background _sync_weights task fails (gated HF model, network error, permission issue, disk full), the init container will poll indefinitely, blocking pod startup forever. Since the background task is fire-and-forget with no error propagation back to the caller, there's no signal to the init container that it should stop.

Consider adding a maximum wait time (e.g., 1-2 hours) after which the container exits with a non-zero status and a descriptive error message:

MAX_WAIT_SECONDS = 7200  # 2 hours
elapsed = 0
while elapsed < MAX_WAIT_SECONDS:
    ...
    time.sleep(30)
    elapsed += 30
print(f"Timed out waiting for model weights at {cp}", flush=True)
sys.exit(1)
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py
Line: 79-85

Comment:
**No timeout — pod hangs forever if weight sync fails**

The `while True` loop has no timeout or maximum iteration count. If the background `_sync_weights` task fails (gated HF model, network error, permission issue, disk full), the init container will poll indefinitely, blocking pod startup forever. Since the background task is fire-and-forget with no error propagation back to the caller, there's no signal to the init container that it should stop.

Consider adding a maximum wait time (e.g., 1-2 hours) after which the container exits with a non-zero status and a descriptive error message:
```python
MAX_WAIT_SECONDS = 7200  # 2 hours
elapsed = 0
while elapsed < MAX_WAIT_SECONDS:
    ...
    time.sleep(30)
    elapsed += 30
print(f"Timed out waiting for model weights at {cp}", flush=True)
sys.exit(1)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 654 to +656
checkpoint_files = self.llm_artifact_gateway.list_files(checkpoint_path)
validate_checkpoint_files(checkpoint_files)
if checkpoint_files:
validate_checkpoint_files(checkpoint_files)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping validation silently changes error behavior for non-sync callers

Previously, validate_checkpoint_files was always called and would raise ObjectHasInvalidValueException("No safetensors found in the checkpoint path.") if the checkpoint contained no safetensors — catching user misconfiguration early. Now, when list_files returns an empty list (e.g., a wrong/empty S3 prefix that isn't related to weight syncing), validation is silently skipped and the endpoint creation proceeds, only failing later at inference time.

Consider guarding this skip more tightly — e.g., only skip when a hf_weights_syncing flag is passed through, rather than universally skipping for any empty file list:

if checkpoint_files:
    validate_checkpoint_files(checkpoint_files)
elif not hf_weights_syncing:
    raise ObjectHasInvalidValueException(
        f"No files found at checkpoint path: {checkpoint_path}"
    )
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Line: 654-656

Comment:
**Skipping validation silently changes error behavior for non-sync callers**

Previously, `validate_checkpoint_files` was always called and would raise `ObjectHasInvalidValueException("No safetensors found in the checkpoint path.")` if the checkpoint contained no safetensors — catching user misconfiguration early. Now, when `list_files` returns an empty list (e.g., a wrong/empty S3 prefix that isn't related to weight syncing), validation is silently skipped and the endpoint creation proceeds, only failing later at inference time.

Consider guarding this skip more tightly — e.g., only skip when a `hf_weights_syncing` flag is passed through, rather than universally skipping for any empty file list:
```python
if checkpoint_files:
    validate_checkpoint_files(checkpoint_files)
elif not hf_weights_syncing:
    raise ObjectHasInvalidValueException(
        f"No files found at checkpoint path: {checkpoint_path}"
    )
```

How can I resolve this? If you propose a fix, please make it concise.

- Hold a strong set reference to each asyncio.Task to prevent GC cancellation
- Deduplicate concurrent sync requests for the same hf_repo via _in_progress dict
- Surface task exceptions via logger.error in _on_task_done callback
- Store ModelWeightsManager as app.state singleton so state persists across requests
- Add recover_hf_syncs startup handler to re-trigger syncs after server restart

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 350 to 351
"FROM endpoints "
"WHERE (endpoint_metadata->'_llm'->>'hf_weights_syncing')::boolean = true"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw SQL query missing schema prefix

The endpoints table lives in the hosted_model_inference schema (see Endpoint.__table_args__ in db/models/hosted_model_inference.py), but this query references FROM endpoints without the schema. Unless the database role's search_path includes hosted_model_inference (which is not configured in db/base.py), this will fail at runtime with relation "endpoints" does not exist.

Suggested change
"FROM endpoints "
"WHERE (endpoint_metadata->'_llm'->>'hf_weights_syncing')::boolean = true"
"SELECT DISTINCT endpoint_metadata->'_llm'->>'model_name' AS model_name "
"FROM hosted_model_inference.endpoints "
"WHERE (endpoint_metadata->'_llm'->>'hf_weights_syncing')::boolean = true"
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/api/app.py
Line: 350-351

Comment:
**Raw SQL query missing schema prefix**

The `endpoints` table lives in the `hosted_model_inference` schema (see `Endpoint.__table_args__` in `db/models/hosted_model_inference.py`), but this query references `FROM endpoints` without the schema. Unless the database role's `search_path` includes `hosted_model_inference` (which is not configured in `db/base.py`), this will fail at runtime with `relation "endpoints" does not exist`.

```suggestion
                    "SELECT DISTINCT endpoint_metadata->'_llm'->>'model_name' AS model_name "
                    "FROM hosted_model_inference.endpoints "
                    "WHERE (endpoint_metadata->'_llm'->>'hf_weights_syncing')::boolean = true"
```

How can I resolve this? If you propose a fix, please make it concise.

quantize: Optional[Quantization] = None
checkpoint_path: Optional[str] = None
chat_template_override: Optional[str] = None
hf_weights_syncing: bool = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_weights_syncing is never cleared after sync completes

Once set to True during endpoint creation, no code path ever sets this back to False in the endpoint metadata after the background sync finishes. ModelWeightsManager._on_task_done only cleans up in-memory tracking — it doesn't update the database.

This causes two problems:

  1. recover_hf_syncs re-triggers downloads on every server restart for all endpoints that ever had this flag set, even if their weights were successfully synced long ago.
  2. The init container (add_hf_weights_init_container) will be added on every subsequent deployment/update of these endpoints, adding unnecessary startup latency even when weights are already present.

Consider adding a callback in _on_task_done (on success) that updates the endpoint metadata to set hf_weights_syncing: false, or have the recover_hf_syncs startup handler check whether the weights are actually present before re-triggering.

Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/entities/llm_entity.py
Line: 34

Comment:
**`hf_weights_syncing` is never cleared after sync completes**

Once set to `True` during endpoint creation, no code path ever sets this back to `False` in the endpoint metadata after the background sync finishes. `ModelWeightsManager._on_task_done` only cleans up in-memory tracking — it doesn't update the database.

This causes two problems:
1. `recover_hf_syncs` re-triggers downloads on **every server restart** for all endpoints that ever had this flag set, even if their weights were successfully synced long ago.
2. The init container (`add_hf_weights_init_container`) will be added on every subsequent deployment/update of these endpoints, adding unnecessary startup latency even when weights are already present.

Consider adding a callback in `_on_task_done` (on success) that updates the endpoint metadata to set `hf_weights_syncing: false`, or have the `recover_hf_syncs` startup handler check whether the weights are actually present before re-triggering.

How can I resolve this? If you propose a fix, please make it concise.

@lilyz-ai lilyz-ai requested a review from a team February 20, 2026 06:42
The endpoints table lives in hosted_model_inference schema; bare
'FROM endpoints' would fail at runtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1393 to +1406
# Resolve checkpoint path: fires background sync and returns expected path immediately
checkpoint_path = request.checkpoint_path
hf_weights_syncing = False
if (
checkpoint_path is None
and request.source == LLMSource.HUGGING_FACE
and self.model_weights_manager is not None
):
models_info = SUPPORTED_MODELS_INFO.get(request.model_name)
if models_info and models_info.hf_repo:
checkpoint_path = self.model_weights_manager.ensure_model_weights_available(
models_info.hf_repo
)
hf_weights_syncing = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_weights_syncing set unconditionally on cache hit

ensure_model_weights_available returns the path immediately without checking whether weights are already cached (the cache check happens asynchronously in the background task). This means hf_weights_syncing is always set to True — even when the weights are already present in storage.

Consequence: on every HF endpoint creation (even with cached weights), the init container in add_hf_weights_init_container is added to the pod spec. The init container will exit quickly on cache hit (since it polls S3 and finds the files), but it still adds startup latency from the init container spin-up and S3 ListObjects call.

Consider making ensure_model_weights_available synchronously check the cache and return a flag indicating whether syncing is actually needed, so hf_weights_syncing is only True when the background task is genuinely downloading:

checkpoint_path, needs_sync = self.model_weights_manager.ensure_model_weights_available(
    models_info.hf_repo
)
hf_weights_syncing = needs_sync
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Line: 1393-1406

Comment:
**`hf_weights_syncing` set unconditionally on cache hit**

`ensure_model_weights_available` returns the path immediately without checking whether weights are already cached (the cache check happens asynchronously in the background task). This means `hf_weights_syncing` is always set to `True` — even when the weights are already present in storage.

Consequence: on every HF endpoint creation (even with cached weights), the init container in `add_hf_weights_init_container` is added to the pod spec. The init container will exit quickly on cache hit (since it polls S3 and finds the files), but it still adds startup latency from the init container spin-up and S3 `ListObjects` call.

Consider making `ensure_model_weights_available` synchronously check the cache and return a flag indicating whether syncing is actually needed, so `hf_weights_syncing` is only `True` when the background task is genuinely downloading:

```python
checkpoint_path, needs_sync = self.model_weights_manager.ensure_model_weights_available(
    models_info.hf_repo
)
hf_weights_syncing = needs_sync
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +336 to +362
async def recover_hf_syncs():
"""Re-trigger weight syncs for endpoints that were syncing when server last stopped."""
from model_engine_server.db.base import get_session_async
from model_engine_server.infra.repositories.live_tokenizer_repository import (
SUPPORTED_MODELS_INFO,
)
from sqlalchemy import text

session_factory = get_session_async()
try:
async with session_factory() as session:
result = await session.execute(
text(
"SELECT DISTINCT endpoint_metadata->'_llm'->>'model_name' AS model_name "
"FROM hosted_model_inference.endpoints "
"WHERE (endpoint_metadata->'_llm'->>'hf_weights_syncing')::boolean = true"
)
)
model_names = [row.model_name for row in result if row.model_name]
except Exception:
logger.warning("Could not query pending HF sync endpoints at startup", exc_info=True)
return
for model_name in model_names:
info = SUPPORTED_MODELS_INFO.get(model_name)
if info and info.hf_repo:
app.state.model_weights_manager.ensure_model_weights_available(info.hf_repo)
logger.info(f"Startup: re-triggered HF weight sync for {model_name}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recover_hf_syncs re-triggers syncs on every restart, indefinitely

Since hf_weights_syncing is never cleared to false after a successful sync (the _on_task_done callback only cleans up in-memory state, not the DB), this startup handler will re-trigger downloads for every endpoint that has ever had hf_weights_syncing=true — even if their weights were successfully synced long ago.

This means on every server restart:

  1. Unnecessary background snapshot_download + upload_files tasks are spawned for already-synced models (the cache-hit path returns quickly, but still issues a list_files I/O call per model).
  2. The init container (add_hf_weights_init_container) will continue to be added to every subsequent deployment of these endpoints, adding startup latency even when weights are present.

Consider either:

  • Adding a step in _on_task_done (on success) that updates the DB to set hf_weights_syncing: false, or
  • Having this handler check whether weights are actually present (via list_files) before re-triggering the sync.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/api/app.py
Line: 336-362

Comment:
**`recover_hf_syncs` re-triggers syncs on every restart, indefinitely**

Since `hf_weights_syncing` is never cleared to `false` after a successful sync (the `_on_task_done` callback only cleans up in-memory state, not the DB), this startup handler will re-trigger downloads for **every** endpoint that has ever had `hf_weights_syncing=true` — even if their weights were successfully synced long ago.

This means on every server restart:
1. Unnecessary background `snapshot_download` + `upload_files` tasks are spawned for already-synced models (the cache-hit path returns quickly, but still issues a `list_files` I/O call per model).
2. The init container (`add_hf_weights_init_container`) will continue to be added to every subsequent deployment of these endpoints, adding startup latency even when weights are present.

Consider either:
- Adding a step in `_on_task_done` (on success) that updates the DB to set `hf_weights_syncing: false`, or
- Having this handler check whether weights are actually present (via `list_files`) before re-triggering the sync.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +70 to +86
_HF_WEIGHTS_POLL_SCRIPT = """\
import boto3, os, sys, time
from urllib.parse import urlparse

cp = os.environ["CHECKPOINT_PATH"]
url = urlparse(cp)
bucket = url.netloc
prefix = url.path.lstrip("/")
s3 = boto3.client("s3")
while True:
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=1)
if resp.get("Contents"):
print(f"Model weights ready at {cp}", flush=True)
sys.exit(0)
print(f"Waiting for model weights at {cp}...", flush=True)
time.sleep(30)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init container poll script hardcodes S3 — fails on GCS/Azure

_HF_WEIGHTS_POLL_SCRIPT imports boto3 and calls s3.list_objects_v2, but the system supports three storage backends (S3, GCS, Azure Blob Storage). The upload_files implementations for GCS and ABS correctly handle their respective backends, so the background sync will succeed — but the init container will never detect the uploaded weights on non-S3 backends and will poll indefinitely.

Additionally, the while True loop has no timeout or maximum iteration count. If the background sync task fails for any reason (gated HF model, network error, permissions), the init container blocks pod startup forever.

Consider:

  1. Dispatching on URL scheme (s3:// → boto3, gs://google.cloud.storage, https://*.blob.core.windows.net → azure SDK)
  2. Adding a maximum wait time (e.g., 1–2 hours) after which the init container exits non-zero with an error message
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/infra/gateways/resources/k8s_endpoint_resource_delegate.py
Line: 70-86

Comment:
**Init container poll script hardcodes S3 — fails on GCS/Azure**

`_HF_WEIGHTS_POLL_SCRIPT` imports `boto3` and calls `s3.list_objects_v2`, but the system supports three storage backends (S3, GCS, Azure Blob Storage). The `upload_files` implementations for GCS and ABS correctly handle their respective backends, so the background sync will succeed — but the init container will never detect the uploaded weights on non-S3 backends and will poll indefinitely.

Additionally, the `while True` loop has no timeout or maximum iteration count. If the background sync task fails for any reason (gated HF model, network error, permissions), the init container blocks pod startup forever.

Consider:
1. Dispatching on URL scheme (`s3://` → boto3, `gs://``google.cloud.storage`, `https://*.blob.core.windows.net` → azure SDK)
2. Adding a maximum wait time (e.g., 1–2 hours) after which the init container exits non-zero with an error message

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant