Closed
Description
Version
0.33.0
Description
When a payload is uploaded to be processed, the workload status seems to be set incorrectly. This causes 500 errors from getStatus when polling endpoint/<job_ID>
for a result while the workload is processing. The video is eventually processed correctly and a result is returned; the errors only happen while processing the workload.
Configuration
cortex.yaml
- name: predict
kind: AsyncAPI
predictor:
type: python
path: api/async/predict.py
dependencies:
shell: startup.sh
config:
device: gpu
input_shape: [512, 512]
compute:
gpu: 1
cpu: 1
cluster.yaml
# EKS cluster name
cluster_name: senseye-api
# AWS region
region: us-east-1
node_groups:
- name: ng-gpu
instance_type: g4dn.8xlarge
min_instances: 2
max_instances: 10
instance_volume_size: 150
instance_volume_type: gp2
# instance_volume_iops: 3000
spot: false
# subnet visibility [public (instances will have public IPs) | private (instances will not have public IPs)]
subnet_visibility: public
# NAT gateway (required when using private subnets) [none | single | highly_available (a NAT gateway per availability zone)]
#nat_gateway: highly_available
# API load balancer scheme [internet-facing | internal]
api_load_balancer_scheme: internal
# operator load balancer scheme [internet-facing | internal]
# note: if using "internal", you must configure VPC Peering to connect your CLI to your cluster operator
operator_load_balancer_scheme: internal
# to install Cortex in an existing VPC, you can provide a list of subnets for your cluster to use
# subnet_visibility (specified above in this file) must match your subnets' visibility
# this is an advanced feature (not recommended for first-time users) and requires your VPC to be configured correctly; see https://eksctl.io/usage/vpc-networking/#use-existing-vpc-other-custom-configuration
# here is an example:
subnets:
- availability_zone: us-east-1a
subnet_id: <subnet ID>
- availability_zone: us-east-1b
subnet_id: <subnet ID>
# additional tags to assign to AWS resources (all resources will automatically be tagged with cortex.dev/cluster-name: <cluster_name>)
tags:
model: hrnet_w40
# SSL certificate ARN (only necessary when using a custom domain)
#ssl_certificate_arn: <ARN>
# primary CIDR block for the cluster's VPC
vpc_cidr: <CIDR block>
I'm pretty sure the config isn't relevant here, but I could be wrong?
Steps to reproduce
- POST to prediction endpoint with a payload (our payloads are URLS of videos stored in S3 buckets, but I don't think the type of payload actually matters)
- Receive a job ID
- GET to prediction endpoint with job ID while workload is processing -> "error: invalid workload status: payload"
Expected behavior
The GET call to endpoint/<job_ID>
should return a status of in_queue | in_progress | failed | completed
Actual behavior
While the workload is processing, the GET call to endpoint/<job_ID>
returns a 500 error with the message "error: invalid workload status: payload"
Screenshots
Stack traces
(error output from cortex logs <api name>
)
{
"caller": "async-gateway/endpoint.go:89",
"error": "invalid workload status: payload",
"id": "<job_ID>",
"labels": {
"apiKind": "AsyncAPI",
"apiName": "predict",
"cortex.dev/api": "true",
"cortex.dev/async": "gateway",
"deploymentID": "<deploymentID>",
"pod-template-hash": "7d6f4bddb5",
"predictorID": "<predictorID>"
},
"level": "error",
"message": "failed to get workload",
"stacktrace":
"main.(*Endpoint).GetWorkload
/workspace/async-gateway/endpoint.go:89
net/http.HandlerFunc.ServeHTTP
/usr/local/go/src/net/http/server.go:2042
github.com/gorilla/mux.(*Router).ServeHTTP
/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210
github.com/gorilla/handlers.(*cors).ServeHTTP
/go/pkg/mod/github.com/gorilla/handlers@v1.5.1/cors.go:54
net/http.serverHandler.ServeHTTP
/usr/local/go/src/net/http/server.go:2843
net/http.(*conn).serve
/usr/local/go/src/net/http/server.go:1925",
"stream": "stdout",
"time": "2021-04-19T19:40:46.765638612Z",
"ts": 1618861246.7654822
}