Skip to content

Container as a Service (CaaS) #2173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 87 commits into from
Jun 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
6fb6c44
Add dd
RobertLucian May 12, 2021
3eb736b
CaaS WIP
RobertLucian May 13, 2021
22067dc
Remove gRPC
RobertLucian May 13, 2021
c7a5274
Merge branch 'master' into feature/caas-api
RobertLucian May 13, 2021
aa6b367
WIP CaaS
RobertLucian May 13, 2021
74dc9c1
WIP CaaS
RobertLucian May 13, 2021
05caab0
Prevent use of reserved/duplicate container names
RobertLucian May 13, 2021
9b68260
Moving around consts
RobertLucian May 13, 2021
24ced7d
Fix make operator-local command
RobertLucian May 13, 2021
70a9f95
Merge branch 'fix/operator-local' into feature/caas-api
RobertLucian May 13, 2021
92adc2f
WIP CaaS
RobertLucian May 14, 2021
6bad607
Add docker client helper fn (might have to rm later on)
RobertLucian May 14, 2021
1af3ee6
Use kubexit just for batch/task APIs
RobertLucian May 14, 2021
6fb6bfb
WIP CaaS
RobertLucian May 14, 2021
6744ec8
Add task example (plus some fixes)
RobertLucian May 14, 2021
f326d62
Rename container for example API
RobertLucian May 14, 2021
73ddf51
Use istio metrics for cortex get (might have to do this in the proxy …
RobertLucian May 17, 2021
c49eff5
Use the rest of istio metrics for cortex get
RobertLucian May 17, 2021
d9b8e6e
HTTP Reverse Proxy implementation (#2172)
May 17, 2021
c501d53
Replace request-monitor with cortex proxy (#2174)
May 17, 2021
2ef2353
Fix make tests command
RobertLucian May 17, 2021
4976481
Operator changes for CaaS implementation (#2177)
RobertLucian May 19, 2021
a123f54
Add readiness probe to realtime cortex proxy (#2176)
May 19, 2021
fbbf1b5
Support binary data in configmap helper
deliahu May 19, 2021
9722b5a
Update CLI and Python Client (#2189)
deliahu May 22, 2021
5d21a40
Add readiness/liveness probes to k8s CaaS resources (#2187)
RobertLucian May 24, 2021
0cad67b
Support exec probe for realtime's readiness probe (#2190)
RobertLucian May 24, 2021
079c20c
Initial docs pass (#2192)
deliahu May 25, 2021
0a1e2c2
CaaS - test APIs and E2E tests (#2191)
RobertLucian May 27, 2021
d7c03dc
CaaS - fix cortex CLI (#2194)
RobertLucian May 27, 2021
481e17e
Fix build-and-push-test-images make cmd
RobertLucian May 27, 2021
ee712f3
Update docs (#2196)
deliahu May 27, 2021
86620a7
Update lint.sh
deliahu May 27, 2021
a51f947
Dequeuer proxy for Async and Batch APIs (#2181)
May 28, 2021
c04bd6e
CaaS - move max_concurrency/max_queue_length fields (#2198)
RobertLucian May 28, 2021
89c0665
CaaS - cleanup and fixes (#2195)
RobertLucian May 28, 2021
54ef26b
Update test status code offsets
deliahu May 28, 2021
42747e9
Remove offset value from print
RobertLucian May 28, 2021
1842bbb
Merge branch 'master' of github.com:cortexlabs/cortex into feature/ca…
deliahu May 28, 2021
66038e2
Update versions.md
deliahu May 28, 2021
e027242
Update CONTRIBUTING.md
deliahu May 28, 2021
fd55001
Remove dev/load
deliahu May 28, 2021
992e6d9
Misc changes
deliahu May 28, 2021
7508c76
Remove support for max_concurrency and max_queue_length for Async APIs
deliahu May 28, 2021
72578d6
Move the max_concurrency/max_queue_length fields for the test APIs (#…
RobertLucian May 28, 2021
85138e9
CaaS - move the shm field to the compute section (#2202)
RobertLucian May 28, 2021
ef41343
Delete projectID
deliahu May 28, 2021
8501030
Fix projectID deletion
deliahu May 28, 2021
8d82a18
Rename handlerID to podID
deliahu May 28, 2021
da7cc02
Delete unnecessary comment
deliahu May 28, 2021
b626d39
Rename Container Implementation
deliahu May 28, 2021
486584a
Update docs
deliahu May 28, 2021
30eea01
CaaS - Async/Batch Deployments (#2201)
RobertLucian May 28, 2021
30efa6d
Add configmap permissions to controller (#2205)
vishalbollu May 31, 2021
e4edc33
Update docs
deliahu May 31, 2021
7dadcd0
CaaS - healthcheck probes for Async/Batch/Realtime (#2206)
RobertLucian May 31, 2021
2b4badd
Add cortex prefix to labels in fluent bit (#2197)
vishalbollu Jun 1, 2021
8b27181
Fix statsd metrics exporting to prometheus (#2207)
vishalbollu Jun 1, 2021
750e7a0
Display cloudwatch url as output for cortex logs (#2208)
vishalbollu Jun 1, 2021
b9bdf43
Update respond.go (#2209)
RobertLucian Jun 1, 2021
527bf25
CaaS - fix respond function (#2213)
RobertLucian Jun 1, 2021
fae614a
Update test image build scripts
deliahu Jun 2, 2021
c0b85bf
Update test/utils/build.sh
deliahu Jun 2, 2021
596a556
Add docs for chaining APIs
deliahu Jun 2, 2021
6a966bf
Update docs
deliahu Jun 2, 2021
8a05a14
Remove hard coded API name and Job ID values in _completedJobLogURLTe…
vishalbollu Jun 2, 2021
2343383
Fix async API queue deletion during cluster down (#2216)
deliahu Jun 2, 2021
d4d7095
Prevent nil pointer and improve logging on failed health checks (#2211)
vishalbollu Jun 2, 2021
8a9a9a6
CaaS - E2E fixes (#2212)
RobertLucian Jun 2, 2021
cd19bbf
Schedule dequeuer first and then user provided containers for BatchAP…
vishalbollu Jun 2, 2021
7705d64
Add test/apis/realtime/hello-world/build-cpu.sh
deliahu Jun 2, 2021
531337f
cache dequeuer in registry.sh
deliahu Jun 2, 2021
48f42eb
Merge branch 'master' into feature/caas-api
RobertLucian Jun 2, 2021
d21509e
Close the request body in batch handler
vishalbollu Jun 2, 2021
76547a2
Persist metrics when status is in completed with failures state
vishalbollu Jun 2, 2021
0d871ce
Update test APIs
deliahu Jun 3, 2021
27d3cd8
Add build-and-push-test-images command
deliahu Jun 3, 2021
8a246af
Rename build-test-api-images
deliahu Jun 3, 2021
6215883
Rename containers.md
deliahu Jun 3, 2021
2882436
Nits
RobertLucian Jun 3, 2021
0e8839e
Update env list formatting
deliahu Jun 3, 2021
38ffa7d
CaaS - batch probe fixes (#2220)
RobertLucian Jun 3, 2021
9d0d7fe
Merge branch 'feature/caas-api' of github.com:cortexlabs/cortex into …
deliahu Jun 3, 2021
3d10a86
CaaS - nits and fixes (#2221)
RobertLucian Jun 3, 2021
7f2aaed
Update examples
deliahu Jun 3, 2021
b178912
Update main.py
deliahu Jun 3, 2021
ca7e454
Update dev workflow
deliahu Jun 4, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 8 additions & 11 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ commands:
- run:
name: Install Go
command: |
wget https://dl.google.com/go/go1.14.7.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.14.7.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
wget https://dl.google.com/go/go1.15.12.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.15.12.linux-amd64.tar.gz
rm -rf go*.tar.gz
echo 'export PATH=$PATH:/usr/local/go/bin' >> $BASH_ENV
echo 'export PATH=$PATH:~/go/bin' >> $BASH_ENV
Expand Down Expand Up @@ -75,18 +76,17 @@ commands:

jobs:
test:
docker:
- image: circleci/python:3.6
machine:
image: ubuntu-1604:202104-01 # machine executor necessary to run go integration tests
steps:
- checkout
- setup_remote_docker
- install-go
- run:
name: Install Linting Tools
command: |
go get -u -v golang.org/x/lint/golint
go get -u -v github.com/kyoh86/looppointer/cmd/looppointer
sudo pip install black aiohttp
pip3 install black aiohttp
- run:
name: Initialize Credentials
command: |
Expand All @@ -111,9 +111,6 @@ jobs:
- run:
name: Go Tests
command: make test-go
- run:
name: Python Tests
command: make test-python

build-and-deploy:
docker:
Expand Down Expand Up @@ -162,8 +159,8 @@ jobs:
node_groups:
- name: spot
instance_type: t3.medium
min_instances: 10
max_instances: 10
min_instances: 16
max_instances: 16
spot: true
- name: cpu
instance_type: c5.xlarge
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ assignees: ''

### Stack traces

(error output from `cortex logs <api name>`)
(error output from CloudWatch Insights or from a random pod `cortex logs <api name>`)

```text
<paste stack traces here>
Expand Down
8 changes: 3 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Remote development

We recommend that you run your development environment on an EC2 instance due to frequent docker registry pushing. We've had a good experience using [Mutagen](https://mutagen.io/documentation/introduction) to synchronize local / remote file systems.
We recommend that you run your development environment on an EC2 instance due to frequent docker registry pushing. We've had a good experience using [Mutagen](https://mutagen.io/documentation/introduction) to synchronize local / remote filesystems.

## Prerequisites

Expand Down Expand Up @@ -169,7 +169,7 @@ node_groups:
Add this to your bash profile (e.g. `~/.bash_profile`, `~/.profile` or `~/.bashrc`), replacing the placeholders accordingly:

```bash
# set the default image for APIs
# set the default image registry
export CORTEX_DEV_DEFAULT_IMAGE_REGISTRY="<account_id>.dkr.ecr.<region>.amazonaws.com/cortexlabs"

# redirect analytics and error reporting to our dev environment
Expand Down Expand Up @@ -209,7 +209,7 @@ Here is the typical full dev workflow which covers most cases:
1. `make cluster-up` (creates a cluster using `dev/config/cluster.yaml`)
2. `make devstart` (deletes the in-cluster operator, builds the CLI, and starts the operator locally; file changes will trigger the CLI and operator to re-build)
3. Make your changes
4. `make images-dev` (only necessary if API images or the manager are modified)
4. `make images-dev` (only necessary if changes were made outside of the operator and CLI)
5. Test your changes e.g. via `cortex deploy` (and repeat steps 3 and 4 as necessary)
6. `make cluster-down` (deletes your cluster)

Expand All @@ -224,6 +224,4 @@ If you are only modifying the CLI, `make cli-watch` will build the CLI and re-bu

If you are only modifying the operator, `make operator-local` will build and start the operator locally, and build/restart it when files are changed.

If you are modifying code in the API images (i.e. any of the Python serving code), `make images-dev` may build more images than you need during testing. For example, if you are only testing using the `python-handler-cpu` image, you can run `./dev/registry.sh update-single python-handler-cpu`.

See `Makefile` for additional dev commands.
21 changes: 6 additions & 15 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,7 @@ async-gateway-update:
@./dev/registry.sh update-single async-gateway
@kubectl delete pods -l cortex.dev/async=gateway --namespace=default

# Docker images

# docker images
images-all:
@./dev/registry.sh update all
images-all-skip-push:
Expand All @@ -136,15 +135,8 @@ images-dev:
images-dev-skip-push:
@./dev/registry.sh update dev --skip-push

images-api:
@./dev/registry.sh update api
images-api-skip-push:
@./dev/registry.sh update api --skip-push

images-manager-skip-push:
@./dev/registry.sh update-single manager --skip-push
images-iris:
@./dev/registry.sh update-single python-handler-cpu

registry-create:
@./dev/registry.sh create
Expand All @@ -170,15 +162,14 @@ format:
# Tests #
#########

test:
@./build/test.sh
# build test api images
# make sure you login with your quay credentials
build-test-api-images:
@./test/utils/build-all.sh quay.io/cortexlabs-test

test-go:
test:
@./build/test.sh go

test-python:
@./build/test.sh python

# run e2e tests on an existing cluster
# read test/e2e/README.md for instructions first
test-e2e:
Expand Down
14 changes: 1 addition & 13 deletions build/build-image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,4 @@ image=$1
if [ "$image" == "inferentia" ]; then
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 790709498068.dkr.ecr.us-west-2.amazonaws.com
fi

build_args=""

if [ "${image}" == "python-handler-gpu" ]; then
cuda=("10.0" "10.1" "10.1" "10.2" "10.2" "11.0" "11.1")
cudnn=("7" "7" "8" "7" "8" "8" "8")
for i in ${!cudnn[@]}; do
build_args="${build_args} --build-arg CUDA_VERSION=${cuda[$i]} --build-arg CUDNN=${cudnn[$i]}"
docker build "$ROOT" -f $ROOT/images/$image/Dockerfile $build_args -t quay.io/cortexlabs/${image}:${CORTEX_VERSION}-cuda${cuda[$i]}-cudnn${cudnn[$i]} -t cortexlabs/${image}:${CORTEX_VERSION}-cuda${cuda[$i]}-cudnn${cudnn[$i]}
done
else
docker build "$ROOT" -f $ROOT/images/$image/Dockerfile $build_args -t quay.io/cortexlabs/${image}:${CORTEX_VERSION} -t cortexlabs/${image}:${CORTEX_VERSION}
fi
docker build "$ROOT" -f $ROOT/images/$image/Dockerfile -t quay.io/cortexlabs/${image}:${CORTEX_VERSION} -t cortexlabs/${image}:${CORTEX_VERSION}
16 changes: 2 additions & 14 deletions build/images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,15 @@

set -euo pipefail

api_images=(
"python-handler-cpu"
"python-handler-gpu"
"tensorflow-handler"
"python-handler-inf"
)

dev_images=(
"downloader"
"manager"
"request-monitor"
"proxy"
"async-gateway"
"enqueuer"
"dequeuer"
)

non_dev_images=(
"tensorflow-serving-cpu"
"tensorflow-serving-gpu"
"cluster-autoscaler"
"operator"
"controller-manager"
Expand All @@ -53,16 +44,13 @@ non_dev_images=(
"kube-rbac-proxy"
"grafana"
"event-exporter"
"tensorflow-serving-inf"
"metrics-server"
"inferentia"
"neuron-rtd"
"nvidia"
"kubexit"
)

all_images=(
"${api_images[@]}"
"${dev_images[@]}"
"${non_dev_images[@]}"
)
6 changes: 6 additions & 0 deletions build/lint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ output=$(cd "$ROOT" && find . -type f \
! -path "**/.vscode/*" \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand Down Expand Up @@ -118,6 +119,7 @@ if [ "$is_release_branch" = "true" ]; then
! -path "**/.vscode/*" \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand All @@ -141,6 +143,7 @@ output=$(cd "$ROOT" && find . -type f \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/.vscode/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand All @@ -164,6 +167,7 @@ output=$(cd "$ROOT" && find . -type f \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/.vscode/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand All @@ -188,6 +192,7 @@ output=$(cd "$ROOT" && find . -type f \
! -path "**/.vscode/*" \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand All @@ -210,6 +215,7 @@ output=$(cd "$ROOT" && find . -type f \
! -path "**/.idea/*" \
! -path "**/.history/*" \
! -path "**/.vscode/*" \
! -path "**/testbin/*" \
! -path "**/__pycache__/*" \
! -path "**/.pytest_cache/*" \
! -path "**/*.egg-info/*" \
Expand Down
11 changes: 1 addition & 10 deletions build/push-image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,4 @@ host=$1
image=$2

echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin

if [ "$image" == "python-handler-gpu" ]; then
cuda=("10.0" "10.1" "10.1" "10.2" "10.2" "11.0" "11.1")
cudnn=("7" "7" "8" "7" "8" "8" "8")
for i in ${!cudnn[@]}; do
docker push $host/cortexlabs/${image}:${CORTEX_VERSION}-cuda${cuda[$i]}-cudnn${cudnn[$i]}
done
else
docker push $host/cortexlabs/${image}:${CORTEX_VERSION}
fi
docker push $host/cortexlabs/${image}:${CORTEX_VERSION}
10 changes: 0 additions & 10 deletions build/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,6 @@ function run_go_tests() {
)
}

function run_python_tests() {
docker build $ROOT -f $ROOT/images/test/Dockerfile -t cortexlabs/test
docker run cortexlabs/test
}

function run_e2e_tests() {
if [ "$create_cluster" = "yes" ]; then
pytest $ROOT/test/e2e/tests --config "$sub_cmd"
Expand All @@ -94,11 +89,6 @@ function run_e2e_tests() {

if [ "$cmd" = "go" ]; then
run_go_tests
elif [ "$cmd" = "python" ]; then
run_python_tests
elif [ "$cmd" = "e2e" ]; then
run_e2e_tests
else
run_go_tests
run_python_tests
fi
32 changes: 30 additions & 2 deletions cli/cluster/logs.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,40 @@ import (
"github.com/gorilla/websocket"
)

func GetLogs(operatorConfig OperatorConfig, apiName string) (schema.LogResponse, error) {
httpRes, err := HTTPGet(operatorConfig, "/logs/"+apiName)
if err != nil {
return schema.LogResponse{}, err
}

var logResponse schema.LogResponse
if err = json.Unmarshal(httpRes, &logResponse); err != nil {
return schema.LogResponse{}, errors.Wrap(err, "/logs/"+apiName, string(httpRes))
}

return logResponse, nil
}

func GetJobLogs(operatorConfig OperatorConfig, apiName string, jobID string) (schema.LogResponse, error) {
httpRes, err := HTTPGet(operatorConfig, "/logs/"+apiName, map[string]string{"jobID": jobID})
if err != nil {
return schema.LogResponse{}, err
}

var logResponse schema.LogResponse
if err = json.Unmarshal(httpRes, &logResponse); err != nil {
return schema.LogResponse{}, errors.Wrap(err, "/logs/"+apiName, string(httpRes))
}

return logResponse, nil
}

func StreamLogs(operatorConfig OperatorConfig, apiName string) error {
return streamLogs(operatorConfig, "/logs/"+apiName)
return streamLogs(operatorConfig, "/streamlogs/"+apiName)
}

func StreamJobLogs(operatorConfig OperatorConfig, apiName string, jobID string) error {
return streamLogs(operatorConfig, "/logs/"+apiName, map[string]string{"jobID": jobID})
return streamLogs(operatorConfig, "/streamlogs/"+apiName, map[string]string{"jobID": jobID})
}

func streamLogs(operatorConfig OperatorConfig, path string, qParams ...map[string]string) error {
Expand Down
51 changes: 0 additions & 51 deletions cli/cluster/patch.go

This file was deleted.

Loading