This repository provides an engaging guide for users to create a serverless computing infrastructure, enabling them to deploy inference at scale effortlessly. By following this step-by-step tutorial, you'll be able to deploy a DeepSparse pipeline on AWS Lambda for real-time inference or on AWS Fargate for efficient batch inference processing. To make the experience more relatable and exciting, we demonstrate these concepts through a practical sentiment analysis use case. Join us on this journey and explore the powerful capabilities of serverless computing for deep learning inference on DeepSparse!
The scope of this application encompasses:
-
The construction of local Docker images.
-
The creation of an ECR repo in AWS.
-
Pushing the local images to ECR.
-
Deploying a:
- Realtime Inference Infrastructure which creates a Lambda function via API Gateway in a Cloudformation stack.
or
- Batch Inference Infrastructure which creates a serverless environment on AWS Fargate via an AWS Batch job triggered by Lambda in a Cloudformation stack.
The following credentials, tools, and libraries are also required:
- The AWS CLI version 2.X that is configured. Double check if the
region
that is configured in your AWS CLI matches the region passed in the SparseLambda class found in theendpoint.py
file. Currently, the default region being used isus-east-1
. - Permissions for select AWS resources: ECR, API Gateway, Cloudformation, Lambda, ECS, S3, and Batch.
- IAM permissions for batch and realtime inference are auto-generated at startup. For IAM permission roles and their policies, please refer to the
template.yaml
files in thebatch
andrealtime
directories.
- IAM permissions for batch and realtime inference are auto-generated at startup. For IAM permission roles and their policies, please refer to the
- The AWS Serverless Application Model (AWS SAM), an open-source CLI framework used for building serverless applications on AWS.
- Docker and the
docker
cli.
To use a different sparse model for batch inference, please edit the model zoo stub in this Dockerfile: batch/app_inf/Dockerfile. For realtime inference, edit here: realtime/app/Dockerfile.
To change pipeline configuration (i.e., change task, engine etc.), edit the pipeline object in either app.py
files. Both files can be found in the /realtime/app/
and /batch/app_inf/
directories.
git clone https://github.com/neuralmagic/deepsparse.git
cd deepsparse/examples/aws-serverless
pip install -r requirements.txt
After installation, you can choose to build either a batch or a realtime serverless infrastructure. Both options are detailed below.
Run the following command to build a batch inference infrastructure:
python endpoint.py create-batch
After build, upload a CSV file to the batch-input-deepsparse
S3 bucket (which was auto-generated) via the AWS console or from the following CLI command to start the batch job:
aws s3 cp <path/to/csv/file> s3://batch-input-deepsparse/
This upload will trigger a Lambda function to start a batch job with a Fargate container running DeepSparse. The CSV file will be read and inputs will be passed into DeepSparse for prediction. Aftewards, the output will be automatically written to a CSV file called outputs.csv
and pushed to the batch-output-deepsparse
S3 bucket.
An example sentiment-inputs.csv
file in the sample
directory is available to familiarize yourself with the file structure the batch architecture is expecting to receive to perform sentiment analysis.
To edit the hardware configuration of the Fargate container, you can edit the default values in the template.yaml file in the batch
directory.
Fargate is currently configured to deploy with 4 vCPUs and 8GB of RAM.
Run the following command to build a realtime inference infrastructure.
python endpoint.py create-realtime
After the endpoint has been staged (~3 minutes), AWS SAM will provide your API Gateway endpoint URL in terminal. You can start making requests by passing this URL into the LambdaClient object. Afterwards, you can run inference by passing in your text input:
from client import LambdaClient
LC = LambdaClient("https://#########.execute-api.us-east-1.amazonaws.com/inference")
answer = LC.client({"sequences": "i like pizza"})
print(answer)
answer: {'labels': ['positive'], 'scores': [0.9990884065628052]}
On your first cold start, it will take a ~60 seconds to invoque your first inference, but afterwards, it should be in milliseconds.
If you want to delete your batch or realtime infrastructure, run:
python endpoint.py destroy
If you want to delete batch or realtime infrastructure AND S3 buckets plus files, run:
python endpoint.py destroy-all