Skip to content

Async API - errors with "invalid workload status: payload" #2104

Closed
@Haplo-Dragon

Description

@Haplo-Dragon

Version

0.33.0

Description

When a payload is uploaded to be processed, the workload status seems to be set incorrectly. This causes 500 errors from getStatus when polling endpoint/<job_ID> for a result while the workload is processing. The video is eventually processed correctly and a result is returned; the errors only happen while processing the workload.

Configuration

cortex.yaml

- name: predict
  kind: AsyncAPI
  predictor:
    type: python
    path: api/async/predict.py
    dependencies:
        shell: startup.sh
    config:
      device: gpu
      input_shape: [512, 512]
  compute:
    gpu: 1
    cpu: 1

cluster.yaml

# EKS cluster name
cluster_name: senseye-api

# AWS region
region: us-east-1

node_groups:
  - name: ng-gpu
    instance_type: g4dn.8xlarge
    min_instances: 2
    max_instances: 10
    instance_volume_size: 150
    instance_volume_type: gp2
    # instance_volume_iops: 3000
    spot: false

# subnet visibility [public (instances will have public IPs) | private (instances will not have public IPs)]
subnet_visibility: public

# NAT gateway (required when using private subnets) [none | single | highly_available (a NAT gateway per availability zone)]
#nat_gateway: highly_available

# API load balancer scheme [internet-facing | internal]
api_load_balancer_scheme: internal

# operator load balancer scheme [internet-facing | internal]
# note: if using "internal", you must configure VPC Peering to connect your CLI to your cluster operator
operator_load_balancer_scheme: internal

# to install Cortex in an existing VPC, you can provide a list of subnets for your cluster to use
# subnet_visibility (specified above in this file) must match your subnets' visibility
# this is an advanced feature (not recommended for first-time users) and requires your VPC to be configured correctly; see https://eksctl.io/usage/vpc-networking/#use-existing-vpc-other-custom-configuration
# here is an example:
subnets:
  - availability_zone: us-east-1a
    subnet_id: <subnet ID>
  - availability_zone: us-east-1b
    subnet_id: <subnet ID>

# additional tags to assign to AWS resources (all resources will automatically be tagged with cortex.dev/cluster-name: <cluster_name>)
tags:
  model: hrnet_w40

# SSL certificate ARN (only necessary when using a custom domain)
#ssl_certificate_arn: <ARN>

# primary CIDR block for the cluster's VPC
vpc_cidr: <CIDR block>

I'm pretty sure the config isn't relevant here, but I could be wrong?

Steps to reproduce

  1. POST to prediction endpoint with a payload (our payloads are URLS of videos stored in S3 buckets, but I don't think the type of payload actually matters)
  2. Receive a job ID
  3. GET to prediction endpoint with job ID while workload is processing -> "error: invalid workload status: payload"

Expected behavior

The GET call to endpoint/<job_ID> should return a status of in_queue | in_progress | failed | completed

Actual behavior

While the workload is processing, the GET call to endpoint/<job_ID> returns a 500 error with the message "error: invalid workload status: payload"

Screenshots

Stack traces

(error output from cortex logs <api name>)

{
    "caller": "async-gateway/endpoint.go:89",
    "error": "invalid workload status: payload",
    "id": "<job_ID>",
    "labels": {
        "apiKind": "AsyncAPI",
        "apiName": "predict",
        "cortex.dev/api": "true",
        "cortex.dev/async": "gateway",
        "deploymentID": "<deploymentID>",
        "pod-template-hash": "7d6f4bddb5",
        "predictorID": "<predictorID>"
    },
    "level": "error",
    "message": "failed to get workload",
    "stacktrace": 
"main.(*Endpoint).GetWorkload
  /workspace/async-gateway/endpoint.go:89
net/http.HandlerFunc.ServeHTTP
  /usr/local/go/src/net/http/server.go:2042
github.com/gorilla/mux.(*Router).ServeHTTP
  /go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210
github.com/gorilla/handlers.(*cors).ServeHTTP
  /go/pkg/mod/github.com/gorilla/handlers@v1.5.1/cors.go:54
net/http.serverHandler.ServeHTTP
  /usr/local/go/src/net/http/server.go:2843
net/http.(*conn).serve
  /usr/local/go/src/net/http/server.go:1925",
    "stream": "stdout",
    "time": "2021-04-19T19:40:46.765638612Z",
    "ts": 1618861246.7654822
}

Additional context

Suggested solution

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions