SageMaker Serverless Inference using BYOC
 
        
        
            Introduction
As we already know, SageMaker can do basically everything from creating, training, deploying, and optimizing ML models. You can use built-in algorithms and models, browse AWS Marketplace to find specific model packages, or simply create your own - train it using SageMaker and deploy it. Everything is streamlined and organized from start to finish.
However, in some circumstances we want a completely custom solution. The idea is to bring our own packages and models i.e. BYOC (Bring Your Own Container). To achieve this we could:
- Extend a prebuilt SageMaker container image - SageMaker provides containers for some of the most common machine learning frameworks, such as Apache MXNet, Tensorflow, PyTorch etc.
- Adapt an existing container image - Modify existing Docker image to enable training and inference using SageMaker
In this article we will focus on deploying our own inference code by adapting a Docker image that contains our production-ready model. Additionally, we will deploy it as a serverless inference endpoint, which means that we don’t have to configure or manage the underlying infrastructure and we only pay for the compute capacity used to process inference requests.1
To do this we will:
- Create a Docker image and configure it for SageMaker inference
- Push the image to ECR
- Create a SageMaker model based on the Docker image
- Configure a SageMaker endpoint
- Deploy the SageMaker endpoint
There are two ways to do this through code: boto3 and CDK - we will cover 
both.
Note: We will go through setting up the server and invocation endpoints for one model, if you are interested in a premade solution for hosting multi model servers please see: Amazon SageMaker Multi-Model Endpoints using your own algorithm container
Docker Image
Behind the scenes SageMaker makes extensive use of Docker containers. All the built-in algorithms and the supported deep learning frameworks used for training and inference are essentially stored in containers. The benefits of this approach is that it allows us to scale quickly and reliably. Consequently, there are certain rules that we have to respect when we implement our own containers:
- For model inference, SageMaker runs the container as:
    docker run <image> serveThis overrides default CMDstatements in a container.
- Containers need to implement a web server that responds to /invocationsand/pingon port 8080
- To get the result from the model, client sends a POST request to the 
SageMaker endpoint, this is forwarded to the container and invoked at 
/invocations, then the result is returned to the client
- A customer’s model containers must respond to requests within 60 seconds
- SageMaker sends periodic GET requests to the /pingendpoint. The response can be just HTTP 200 status with an empty body
See the details at Use Your Own Inference Code with Hosting Services
To implement our container and satisfy these requirements, we will use Nginx and gunicorn. The idea is to create a simple Flask application, set up a WSGI server using gunicorn and then use the Nginx as a reverse-proxy.
The structure looks like this:
root/
├─ model
│   ├─ nginx.conf       Contains the configuration for reverse-proxy
│   ├─ predictor.py     Contains the Flask application
│   ├─ serve            Starts the Nginx and WSGI
│   └─ wsgi.py          Defines the WSGI application
├─ Dockerfile           Defines the Docker image configuration
This can also be found in the amazon sagemaker examples GitHub repository provided by AWS.
To define a reverse-proxy to gunicorn, use the following configuration in 
nginx.conf. 
The serve 
will start the gunicorn and a reverse-proxy server.
The predictor.py 
contains the endpoint logic. The GET should check if the 
model is loaded and configured properly:
@app.route('/ping', methods=['GET'])
def ping():
    # Check if the model was loaded correctly
    health = is_model_ready()
    status = 200 if health else 404
    return flask.Response(response= '\n', status=status, mimetype='application/json')
Next we define the POST request for /invocations, this part of the code 
should implement your custom model predictions:
@app.route('/invocations', methods=['POST'])
def transformation():
    
    # Process input
    input_json = flask.request.get_json()
    data = input_json['input']
    
    # Custom model
    result = custom_model.predict(data)
    # Return value
    resultjson = json.dumps(result)
    return flask.Response(response=resultjson, status=200, mimetype='application/json')
In order to build a docker image we define the Dockerfile:
FROM python:3.8
RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python3 \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
    pip install flask gevent gunicorn && \
        rm -rf /root/.cache
# Install all dependencies for your custom model
# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY model /opt/program
WORKDIR /opt/program
Note that the Dockerfile should contain the commands that install all the 
dependencies needed for the custom model.
Finally, we have to build and push the image to the ECR. To do this, we can use a simple bash script:
model_name=<model-name>
account=$(aws sts get-caller-identity --query Account --output text)
region=<region>
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${model_name}:latest"
chmod +x model/serve
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${fullname}
docker build -t ${model_name} .
docker tag ${model_name} ${fullname}
docker push ${fullname}
Note: This script is a simple version of
build_and_push.shthat is provided in the official AWS GitHub repository.
SageMaker
Boto3
Once we have the image in the ECR, we can create a SageMaker model. We will do 
this using the
SageMaker Boto3 Client. 
One of the parameters of create_model() method is ExectionRoleArn, 
which means that we will have to create an IAM role beforehand or use the 
get_execution_role(), please see 
SageMaker Roles.
import boto3
sm_client = boto3.client(service_name='sagemaker')
def create_model():
    role_arn = "<role-arn>"
    image = "{}.dkr.ecr.{}.amazon.com/{}:latest".format(
        "<profile>", "<region>", "<image-name>"
    )
    create_model_response = sm_client.create_model(
        ModelName="<model-name>",
        ExecutionRoleArn=role_arn,
        Containers=[{"Image": image}],
    )
    print(create_model_response)
If everything went well, you should see the model in the SageMaker/Models console.
The next step is to define an endpoint configuration. This step is crucial since 
we are defining a model that we want to host and the resources chosen to 
deploy for hosting it. In other words, we are configuring a 
ProductionVariant which can take many arguments for defining instance types, 
how to distribute traffic among multiple modes etc. However, we are only 
interested in ServerlessConfig.
def create_endpoint_configuration():
    create_endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName="<endpoint-config-name>",
        ProductionVariants=[
            {
                "ModelName": "<model-name>",
                "VariantName": "<variant-name>",
                "ServerlessConfig": {
                    "MemorySizeInMB": 2048,
                    "MaxConcurrency": 1,
                },
            }
        ],
    )
    print(create_endpoint_config_response)
SageMaker console has the Endpoint Configurations section where we can confirm the configuration.
After configuring the endpoint, we can deploy it. This can take a few minutes. 
SageMaker Client offers the get_waiter() method that returns an object that 
can wait for some condition, in this case for an endpoint to be in service.
def create_endpoint():
    create_endpoint_response = sm_client.create_endpoint(
        EndpointName="<endpoint-name>",
        EndpointConfigName="<endpoint-config-name>",
    )
    print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])
    resp = sm_client.describe_endpoint(EndpointName="<endpoint-name>")
    print("Endpoint Status: " + resp["EndpointStatus"])
    print("Waiting for {} endpoint to be in service".format("<endpoint-name>"))
    waiter = sm_client.get_waiter("endpoint_in_service")
    waiter.wait(EndpointName="<endpoint-name>")
Finally, we can use the SageMaker Runtime Client for inference and invoking the endpoint.
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")
def invoke_endpoint():
    content_type = "application/json"
    request_body = {}
    payload = json.dumps(request_body)
    response = runtime_sm_client.invoke_endpoint(
        EndpointName="<endpoint-name>",
        ContentType=content_type,
        Body=payload,
    )
    result = json.loads(response["Body"].read().decode())
    print(result)
CDK
The same process of creating a model, an endpoint configuration, and deployment of an endpoint can be achieved through a CDK application. This is usually a better option since we can manage the infrastructure from the code and deploy services as stacks.
In order to use SageMaker constructs we’ll need to install 
@aws-cdk/aws-sagemaker module. There are L1 Cfn constructs for each service 
that we need to configure: 
CfnModel, 
CfnEndpointConfig
, and 
CfnEndpoint.
We will approach this by setting up a CDK application and a separate stack construct for SageMaker services.
Note: If you are not sure how to start with a CDK application, please see Your first AWS CDK app
The stack will have three methods:
- create_model()- Creates an IAM role and a SageMaker model based on the Docker image name
- create_endpoint_configuration()- Creates an endpoint configuration for a specific model
- create_endpoint()- Deploys the endpoint based on the provided endpoint configuration
The code below implements a simple example of this stack:
from aws_cdk import core
from aws_cdk.aws_iam import Role, ManagedPolicy, ServicePrincipal
from aws_cdk.aws_sagemaker import CfnModel, CfnEndpointConfig, CfnEndpoint
class SageMakerStack(core.Stack):
    def __init__(
        self,
        scope: core.Construct,
        id_: str,
        env: core.Environment,
    ) -> None:
        super().__init__(scope=scope, id=id_, env=env)
        self.env = env
    def create_model(
        self,
        id_: str,
        model_name: str,
        image_name: str,
    ) -> CfnModel:
        role = Role(
            self,
            id=f"{id_}-SageMakerRole",
            role_name=f"{id_}-SageMakerRole",
            assumed_by=ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[
                ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")
            ],
        )
        container = CfnModel.ContainerDefinitionProperty(
            container_hostname="<container-hostname>",
            image="{}.dkr.ecr.eu-west-1.amazonaws.com/{}:latest".format(
                self.env.account, image_name
            ),
        )
        return CfnModel(
            self,
            id=f"{id_}-SageMakerModel",
            model_name=model_name,
            execution_role_arn=role.role_arn,
            containers=[container],
        )
    def create_endpoint_configuration(
        self,
        id_: str,
        model_name: str,
        endpoint_configuration_name: str,
    ) -> CfnEndpointConfig:
        return CfnEndpointConfig(
            self,
            id=f"{id_}-SageMakerEndpointConfiguration",
            endpoint_config_name=endpoint_configuration_name,
            production_variants=[
                CfnEndpointConfig.ProductionVariantProperty(
                    model_name=model_name,
                    initial_variant_weight=1.0,
                    variant_name="AllTraffic",
                    serverless_config=CfnEndpointConfig.ServerlessConfigProperty(
                        max_concurrency=1,
                        memory_size_in_mb=2048,
                    ),
                )
            ],
        )
    def create_endpoint(
        self,
        id_: str,
        endpoint_configuration_name: str,
        endpoint_name: str,
    ) -> CfnEndpoint:
        return CfnEndpoint(
            self,
            id=f"{id_}-SageMakerEndpoint",
            endpoint_config_name=endpoint_configuration_name,
            endpoint_name=endpoint_name,
        )
Now we can use this stack class to deploy multiple models in one or more stacks.
from aws_cdk import core
from stacks.sagemaker import SageMakerStack
class SimpleExampleApp(core.App):
    def __init__(self) -> None:
        super().__init__()
        env = core.Environment(
            account="<account>",
            region="<region>",
        )
        sagemaker = SageMakerStack(
            scope=self,
            id_="app-sagemaker-stack",
            env=env,
        )
        model = sagemaker.create_model(
            id_="AppModel",
            model_name="<model-name>",
            image_name="<image-name>",
        )
        endpoint_config = sagemaker.create_endpoint_configuration(
            id_="AppEndpointConfiguration",
            model_name="<model-name>",
            endpoint_configuration_name="app-endpoint-configuration",
        )
        endpoint_config.add_depends_on(model)
        endpoint = sagemaker.create_endpoint(
            id_="AppEndpoint",
            endpoint_configuration_name="app-endpoint-configuration",
            endpoint_name="app-endpoint",
        )
        endpoint.add_depends_on(endpoint_config)
simple_app = SimpleExampleApp()
simple_app.synth()
Sometimes CDK cannot infer the right order to provision our resources in. 
For example, the creation of endpoint configuration may start before the model 
is defined, which doesn’t make sense in this example. That’s why we add 
A.add_depends_on(B) to each CfnResource and it will inform the CDK 
that the creation of resource A should follow the creation of resource B.
Now we can generate CloudFormation templates and deploy custom models for serverless inference as stacks that can be easily managed.
Note: If you also want to manage Docker images through AWS CDK, please take a look at AWS CDK Docker Image Assets. However, this approach will publish image assets to the CDK-controlled ECR repository. To publish Docker images to an ECR repository in your control, please see cdk-ecr-deployment.
Final Words
I hope that this article gave you a better understanding of how to implement a custom model using the SageMaker and deploy it for the serverless inference. The main key concepts here are the configuration of a custom Docker image and connection between a model, an endpoint configuration, and an endpoint.
The code examples are deliberately simplified and serve only to introduce the key concepts and ideas. For more information and examples please check out the official AWS repository for Advanced SageMaker Functionality Examples.
If you have any questions or suggestions, please reach out, I’m always available.
Resources
- AWS Docs - Using Docker containers with SageMaker
- AWS Docs - Use Your Own Inference Code with Hosting Services
- AWS Docs - SageMaker Roles
- Amazon SageMaker Multi-Model Endpoints using your own algorithm container
- Bring Your Own Container With Amazon SageMaker
- Advanced Amazon SageMaker Functionality Examples
Leave a comment