Skip to content

At scale some AWs to do not enter in complete state #657

@asm582

Description

@asm582

Describe the Bug

At scale, some AWs do not enter into a complete state due to the fact that the informer and etcd do not agree.

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK:
MCAD:

Steps to Reproduce the Bug

Fire 1K AWs with very short jobs (10 seconds) and wait for completion of all 1K AWs

What Have You Already Tried to Debug the Issue?

I have run scale tests to reproduce the issue

Expected Behavior

All AWs should be completed.

Screenshots, Console Output, Logs, etc.

NA

Affected Releases

Current 1.35.0 release and main branch

Additional Context

NA

Add any other information you think might be useful here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions