Amazon SageMaker AI Async Inference now supports inline application loading

Today, we're announcing online payment support for Amazon SageMaker AI Async Inference. Customers can now send a payment estimate directly to the application board of InvokeEndpointAsync API, removes the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each request.
With payloads of up to 128,000 bytes, this eliminates all network round-trips, simplifies client-side code, and reduces workloads for complex workloads.
In this post, we explain the motivation behind this feature, go over the before and after customer experience, and show you how you can start using paid online uploads today.
Background: How async inference used to work
You can use Amazon SageMaker AI Async Inference to submit inference requests and process them asynchronously. It's a good fit for workloads with large payloads, dynamic traffic, or latency tolerances of seconds to minutes. It supports automatic scaling to zero, making it cost-effective for burst or batch-style workloads.
Until now, the workflow required two steps for every request:
- Upload it upload load to Amazon S3 bucket.
- Discipline endpoint, passing the S3 URI object as
InputLocation.
The endpoint processes the request asynchronously and writes the output to a designated S3 location, where the client polls it or receives it via an Amazon Simple Notification Service (Amazon SNS) notification.
This two-step pattern works well for large uploads (images, audio, multi-MB documents). But for customers with smaller input payloads (in KB) that require longer processing times than allow real-time routing, the mandatory dependency on S3 added unnecessary complexity.
What's new: Linear checkout with Body parameter
With today's launch, InvokeEndpointAsync accept the new one Body parameter. If present, the payload is sent over the Internet to the API request itself, and the required S3 can be loaded.
Important details:
| A feature | Details |
| A new parameter | Bodyraw bytes, with an average of 128,000 bytes. |
| Maximum line size | 128,000 bytes (raw payload). |
| Shared diversity | Body again InputLocation they are not the same. The API rejects requests that set both. |
| Output behavior | It hasn't changed. The output is documented in S3 OutputLocation. |
| End point compatibility | Designed to work with existing async endpoints; no model or container changes are expected. |
| Error handling | Violation of the size and variation combined restores synchronously ValidationError the answers. |
| Availability | Available in 31 AWS regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, MXP, CPT,. |
Before and after: Customer experience
The change is very clear in the code. The following two examples make the same async request to the same endpoint. The first uses the S3 loading step that was needed until now, and the second uses inline Body parameter that replaces it.
Before: Upload to S3 first, then request
This method requires:
- An S3 client and input bucket are provided.
- AWS Identity and Access Management (IAM)
s3:PutObjectpermission to the caller. - Naming scheme (UUID or similar) to avoid key conflicts.
- Cleaning strategy for old installations.
After: Send the payload to the queue
No S3 client, no uuidno install bucket, no IAM grants in the install path, no old object cleanup.
Customer benefits
Sending paid uploads to the Internet removes the network hop and dependency on each request. That translates into five tangible benefits:
- Reduced latency. One network round trip and one S3 PUT is issued per request. With a full fan load, these latency savings add up nicely.
- Simple structures. Avoids input bucket provisioning, lifecycle policies, cross-account access patterns, and caller IAM
s3:PutObjectpermission in the installation method. - Several error paths. A request is a single API call. Either it follows or it doesn't.
- Low cost. Removes the S3 PUT charge for loading input from all queued requests.
- Fast confirmation response. Size errors and joint variances are returned at the same time.
When is each method used?
Inline loading is often the easiest option for small payloads, however InputLocation it still has its place. Use the following table to determine which method is appropriate for a particular job:
| The situation | Recommended method |
| Payload <= 128,000 bytes (JSON data, structured data) | On the line Body. It's easy. Avoids one round trip network and S3 PUT costs. |
| Payout > 128,000 bytes (images, audio, large documents) | InputLocation. Upload to S3 first. |
| A mixed workload with varying payload sizes | Branch by size. Use it Body for the little ones, InputLocation big ones. |
| You need to save the input data to S3 for testing or playback | InputLocation. Saves the input to your bucket. |
Getting started
See the example code notebook for a full walkthrough.
Before you begin, make sure you have:
- An existing Amazon SageMaker AI Async Inference endpoint (verify with
aws sagemaker describe-endpoint --endpoint-name my-async-endpoint). - The latest AWS SDK for Python (Boto3) is installed and updated with details.
- IAM permissions for
sagemaker:InvokeEndpointAsync. - An outgoing S3 bucket configured for your async endpoint (for example,
my-output-bucket).
Note: Following this guide uses billable AWS services. SageMaker AI's async endpoints cost instance hours, and S3 buckets cost storage and requests. Follow the cleaning steps after completing the tutorial to avoid ongoing charges.
Steps
Online payment upload support is available today. To use it:
- Update your AWS SDK. Install or upgrade Boto3 to the latest version:
pip install --upgrade boto3. - Confirm installation:
pip show boto3. - Replace your application code. In your application, change S3 upload +
InputLocationpattern with exactBodyparameter, as shown in the preceding code example. - Check your application by calling i
InvokeEndpointAsyncAPI withBodyparameter. - Confirm the answer contains i
OutputLocationfield. - Survey or monitor S3
OutputLocationto ensure that your conclusion is written successfully.
No changes are required to your endpoint configuration, model container, or S3 output setup.
Clean up
To avoid ongoing charges, remove services used for this trip:
- Remove the SageMaker AI endpoint if it was created for testing:
- Delete the output S3 bucket (if no longer needed). Warning: Deleting an S3 bucket permanently deletes its contents. Make sure to back up any index results you need to keep.
- Delete any IAM policies created for this course.
The conclusion
SageMaker AI Async Inference's inline payload support removes a common point of friction in inference workflows: the mandatory S3 payload for every request. With payload sizes between 128,000 bytes, you can now make one API call and let SageMaker AI handle the rest.
The feature is designed to be backwards compatible. He is there InputLocation the workflow continues unchanged. Both inline and S3 inputs are processed the same when a request is received, and the models receive the same requests regardless of the input source.
Get started today by updating your AWS SDK and using the Body parameter in the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.
About the writers



