Amazon SageMaker AI Async Inference now supports inline application loading

0 9 5 minutes read

Amazon SageMaker AI Async Inference now supports inline application loading

Today, we're announcing online payment support for Amazon SageMaker AI Async Inference. Customers can now send a payment estimate directly to the application board of InvokeEndpointAsync API, removes the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each request.

With payloads of up to 128,000 bytes, this eliminates all network round-trips, simplifies client-side code, and reduces workloads for complex workloads.

In this post, we explain the motivation behind this feature, go over the before and after customer experience, and show you how you can start using paid online uploads today.

Background: How async inference used to work

You can use Amazon SageMaker AI Async Inference to submit inference requests and process them asynchronously. It's a good fit for workloads with large payloads, dynamic traffic, or latency tolerances of seconds to minutes. It supports automatic scaling to zero, making it cost-effective for burst or batch-style workloads.

Until now, the workflow required two steps for every request:

Upload it upload load to Amazon S3 bucket.
Discipline endpoint, passing the S3 URI object as InputLocation.

The endpoint processes the request asynchronously and writes the output to a designated S3 location, where the client polls it or receives it via an Amazon Simple Notification Service (Amazon SNS) notification.

This two-step pattern works well for large uploads (images, audio, multi-MB documents). But for customers with smaller input payloads (in KB) that require longer processing times than allow real-time routing, the mandatory dependency on S3 added unnecessary complexity.

What's new: Linear checkout with Body parameter

With today's launch, InvokeEndpointAsync accept the new one Body parameter. If present, the payload is sent over the Internet to the API request itself, and the required S3 can be loaded.

Important details:

A feature	Details
A new parameter	`Body`raw bytes, with an average of 128,000 bytes.
Maximum line size	128,000 bytes (raw payload).
Shared diversity	`Body` again `InputLocation` they are not the same. The API rejects requests that set both.
Output behavior	It hasn't changed. The output is documented in S3 `OutputLocation`.
End point compatibility	Designed to work with existing async endpoints; no model or container changes are expected.
Error handling	Violation of the size and variation combined restores synchronously `ValidationError` the answers.
Availability	Available in 31 AWS regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, MXP, CPT,.

Before and after: Customer experience

The change is very clear in the code. The following two examples make the same async request to the same endpoint. The first uses the S3 loading step that was needed until now, and the second uses inline Body parameter that replaces it.

Before: Upload to S3 first, then request

import boto3, json, uuid

s3 = boto3.client("s3")
sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")

# 1. Upload the request payload to S3 (extra latency + cost)
input_key = f"async-input/{uuid.uuid4()}.json"
s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"

# 2. Invoke the endpoint
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    InputLocation=input_location,
    ContentType="application/json",
)

print(response["OutputLocation"])

This method requires:

An S3 client and input bucket are provided.
AWS Identity and Access Management (IAM) s3:PutObject permission to the caller.
Naming scheme (UUID or similar) to avoid key conflicts.
Cleaning strategy for old installations.

After: Send the payload to the queue

import boto3, json

sagemaker_runtime = boto3.client("sagemaker-runtime")

payload = json.dumps({"inputs": "your prompt here"}).encode("utf-8")

# One call, no S3 upload, no input bucket needed
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    Body=payload,
    ContentType="application/json",
)

print(response["OutputLocation"])

No S3 client, no uuidno install bucket, no IAM grants in the install path, no old object cleanup.

Customer benefits

Sending paid uploads to the Internet removes the network hop and dependency on each request. That translates into five tangible benefits:

Reduced latency. One network round trip and one S3 PUT is issued per request. With a full fan load, these latency savings add up nicely.
Simple structures. Avoids input bucket provisioning, lifecycle policies, cross-account access patterns, and caller IAM s3:PutObject permission in the installation method.
Several error paths. A request is a single API call. Either it follows or it doesn't.
Low cost. Removes the S3 PUT charge for loading input from all queued requests.
Fast confirmation response. Size errors and joint variances are returned at the same time.

When is each method used?

Inline loading is often the easiest option for small payloads, however InputLocation it still has its place. Use the following table to determine which method is appropriate for a particular job:

The situation	Recommended method
Payload <= 128,000 bytes (JSON data, structured data)	On the line `Body`. It's easy. Avoids one round trip network and S3 PUT costs.
Payout > 128,000 bytes (images, audio, large documents)	`InputLocation`. Upload to S3 first.
A mixed workload with varying payload sizes	Branch by size. Use it `Body` for the little ones, `InputLocation` big ones.
You need to save the input data to S3 for testing or playback	`InputLocation`. Saves the input to your bucket.

Getting started

See the example code notebook for a full walkthrough.

Before you begin, make sure you have:

An existing Amazon SageMaker AI Async Inference endpoint (verify with aws sagemaker describe-endpoint --endpoint-name my-async-endpoint).
The latest AWS SDK for Python (Boto3) is installed and updated with details.
IAM permissions for sagemaker:InvokeEndpointAsync.
An outgoing S3 bucket configured for your async endpoint (for example, my-output-bucket).

Note: Following this guide uses billable AWS services. SageMaker AI's async endpoints cost instance hours, and S3 buckets cost storage and requests. Follow the cleaning steps after completing the tutorial to avoid ongoing charges.

Steps

Online payment upload support is available today. To use it:

Update your AWS SDK. Install or upgrade Boto3 to the latest version: pip install --upgrade boto3.
Confirm installation: pip show boto3.
Replace your application code. In your application, change S3 upload + InputLocation pattern with exact Body parameter, as shown in the preceding code example.
Check your application by calling i InvokeEndpointAsync API with Body parameter.
Confirm the answer contains i OutputLocation field.
Survey or monitor S3 OutputLocation to ensure that your conclusion is written successfully.

No changes are required to your endpoint configuration, model container, or S3 output setup.

Clean up

To avoid ongoing charges, remove services used for this trip:

Remove the SageMaker AI endpoint if it was created for testing:

aws sagemaker delete-endpoint --endpoint-name my-async-endpoint

Delete the output S3 bucket (if no longer needed). Warning: Deleting an S3 bucket permanently deletes its contents. Make sure to back up any index results you need to keep.
```
aws s3 rb s3://my-output-bucket --force
```
Delete any IAM policies created for this course.

The conclusion

SageMaker AI Async Inference's inline payload support removes a common point of friction in inference workflows: the mandatory S3 payload for every request. With payload sizes between 128,000 bytes, you can now make one API call and let SageMaker AI handle the rest.

The feature is designed to be backwards compatible. He is there InputLocation the workflow continues unchanged. Both inline and S3 inputs are processed the same when a request is received, and the models receive the same requests regardless of the input source.

Get started today by updating your AWS SDK and using the Body parameter in the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.