Skip to content

[azure-ai-ml] Bug when launching Job to AML and updating Schedule. #40288

Open
@Uranium2

Description

@Uranium2
  • azure-ai-ml==1.26.0 (tried other versions)
  • Python 3.10

Describe the bug
When creating a simple @pipeline in azure-ai-ml with one command step, I get an error on updating the schedule when I launch the Job before. The issue did not occured few weeks ago. But when updating the schedule and then reploy the job, I have no errors.

To Reproduce
Steps to reproduce the behavior:

  1. I created a tool to make steps easier to create in SDK V2 (because we had it in V1) https://stackoverflow.com/a/77354028
  2. Create a step with the Step class
  3. Use create_pipeline function
  4. Deploy the pipeline
  5. Create a Schedule
  6. Update the schedule (here is the bug)
step_1 = Step(
    display_name="step_1",
    description="step_1",
    environment=...,
    command="python main.py",
    code...,
    is_deterministic=False,
)
pipeline_job = create_pipeline(steps_graph, default_compute="my_compute", name="my_pipeline", experiment_name="my_experiment")

# Publish job
pipeline_job = ml_client.jobs.create_or_update(pipeline_job)

# Make schdule
schedule_start_time = datetime.now()
cron_trigger = CronTrigger(
    expression="0 6 * * 1",
    start_time=schedule_start_time,
    time_zone=TimeZone.CENTRAL_EUROPEAN_STANDARD_TIME,
)
job_schedule = JobSchedule(
    name="my_schedule",
    trigger=cron_trigger,
    create_job=pipeline_job,
)
job_schedule = self.ml_client.schedules.begin_create_or_update(
    schedule=job_schedule
).result()

Expected behavior
Update the schedule correctly.

errors
Here is the error on Python side:

Source path of Step 'ds_view_check': /mnt/c/Users/XXXXX/Documents/GitHub/data_management
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Warning: the provided asset name 'ds_view_check' will not be used for anonymous registration

Uploading data_management (15.81 MBs): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15807217/15807217 [00:01<00:00, 10587620.66it/s]

Pipeline created successfully.
name: lucid_card_xxxxxxxx
display_name: ds_view_check
description: Default pipeline function to be executed.
tags:
  TOOLS: AML
  TARGET: OTHER
  PROCESS: OTHER
  FRAMEWORK_VERSION: 2.6.0
  PYTHON: '3.10'
type: pipeline
jobs:
  ds_view_check:
    type: command
    inputs:
      DEPLOY_ENV: dev
      SUBSCRIPTION_ENV: dev
    component: azureml:azureml_anonymous:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    compute: azureml:cpu-16-128
    identity:
      type: managed_identity
      client_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
      object_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
creation_context:
  created_at: '2025-03-31T07:33:44.055977+00:00'
  created_by: UserX
  created_by_type: User
experiment_name: ds_view_check
id: azureml:/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/rg-xxxx/providers/Microsoft.MachineLearningServices/workspaces/mlw-xxxx/jobs/lucid_card_xxxxxxxx  
properties:
  mlflow.source.git.repoURL: [REDACTED]
  mlflow.source.git.branch: main
  mlflow.source.git.commit: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  azureml.git.dirty: 'True'
services:
  Tracking:
    endpoint: [REDACTED]
    type: Tracking
  Studio:
    endpoint: [REDACTED]
    type: Studio
status: NotStarted

Readonly attribute status will be ignored in class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.JobService'>

Traceback (most recent call last):
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/base_polling.py", line 788, in run
    self._poll()
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/base_polling.py", line 820, in _poll
    raise OperationFailed("Operation failed or canceled")
azure.core.polling.base_polling.OperationFailed: Operation failed or canceled

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/c/Users/XXXXX/Documents/GitHub/data_management/aml_scheduling.py", line 98, in <module>
    schedule = azureml_helper.create_or_update_schedule(
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/datalab_framework/aml_scheduling/azureml_helper.py", line 770, in create_or_update_schedule
    ).result()
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/_poller.py", line 254, in result
    self.wait(timeout)
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/tracing/decorator.py", line 116, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/_poller.py", line 273, in wait
    raise self._exception  # type: ignore
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/_poller.py", line 188, in _start
    self._polling_method.run()
  File "/home/XXXXX/anaconda3/envs/data_management/lib/python3.10/site-packages/azure/core/polling/base_polling.py", line 803, in run
    raise HttpResponseError(response=self._pipeline_response.http_response, error=err) from err
azure.core.exceptions.HttpResponseError: (UserError) Invalid trigger definition, details: Microsoft.MachineLearning.Common.Core.ServiceInvocationException: Service invocation failed!
Request: POST smt.designer-westeurope.svc/studioservice/api/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/rg-xxxx/workspaces/mlw-xxxx/GenericTriggerJob/ParseJob
Status Code: 400 BadRequest
Error Code: UserError/BadArgument/ArgumentInvalid/InvalidPipelineJob/InvalidJobsOverride
Reason Phrase: Invalid jobs override for pipeline job since source job lucid_card_xxxxxxxx is specified.

Here is the error on AML UI side:

The schedule "[SCHEDULE_NAME]" could not be updated. Failure reason: UserError, Invalid trigger definition, details: Microsoft.MachineLearning.Common.Core.ServiceInvocationException: Service invocation failed!

Request: POST smt.designer-[REGION].svc/studioservice/api/subscriptions/[SUBSCRIPTION_ID]/resourceGroups/[RESOURCE_GROUP]/workspaces/[WORKSPACE]/GenericTriggerJob/ParseJob

Status Code: 400 BadRequest
Error Code: UserError/BadArgument/ArgumentInvalid/InvalidPipelineJob/InvalidJobsOverride
Reason Phrase: Invalid jobs override for pipeline job since source job [SOURCE_JOB_ID] is specified.

Response Body:
{
  "error": {
    "code": "UserError",
    "message": "Invalid jobs override for pipeline job since source job [SOURCE_JOB_ID] is specified.",
    "innerError": {
      "code": "BadArgument",
      "innerError": {
        "code": "ArgumentInvalid",
        "innerError": {
          "code": "InvalidPipelineJob",
          "innerError": {
            "code": "InvalidJobsOverride"
          }
        }
      }
    }
  },
  "correlation": {
    "operation": "[OPERATION_ID]",
    "request": "[REQUEST_ID]"
  },
  "environment": "[REGION]",
  "location": "[REGION]",
  "time": "2025-03-31T07:33:46.687716+00:00",
  "componentName": "Designer-MiddleTier-Service",
  "statusCode": 400
}

Image

When updating the schedule then creating the job, it works. And this issue is new, my code and version of azure-ai-ml did not change.

Metadata

Metadata

Labels

ClientThis issue points to a problem in the data-plane of the library.Machine LearningService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.issue-addressedWorkflow: The Azure SDK team believes it to be addressed and ready to close.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions