Reactive Machines

Improved metrics for Amazon SageMaker AI endpoints: deeper visibility into better performance

Effective machine learning (ML) models in manufacturing require more than just infrastructure robustness and scalability. You need almost continuous visibility into the operation and use of the resource. When delays increase, requests fail, or resources become difficult, you need to understand quickly to identify and resolve issues before they impact your customers.

Until now, Amazon SageMaker AI provided Amazon CloudWatch metrics that provided high-level operational visibility, but these were aggregated metrics across instances and containers. Although useful for overall health monitoring, these aggregated metrics obscure the details of each instance and container, making it difficult to identify bottlenecks, improve resource utilization, or effectively troubleshoot.

SageMaker AI endpoints now support advanced metrics with configurable publishing frequency. This presentation provides the granular visibility needed to monitor, troubleshoot, and improve your productivity results. With SageMaker AI's improved endpoint metrics, we can now drill down into container-level and instance-level metrics, providing capabilities such as:

  1. View copy metrics for a specific model. With multiple model copies deployed across a SageMaker AI endpoint using Inference Components, it is useful to view metrics for each model copy such as concurrent requests, GPU utilization, and CPU utilization to help identify issues and provide visibility into production workload traffic patterns.
  2. See how much each model costs. Since many models share the same infrastructure, calculating the actual cost per model can be complicated. With improved metrics, we can now calculate and correlate costs per model based on GPU rendering at the inference component level.

What's new

Advanced metrics introduces two categories of metrics with multiple levels of granularity:

  • EC2 Resource Utilization Metrics: Trace CPU, GPU, and memory usage at the instance and container level.
  • Invocation Metrics: Monitor request patterns, errors, delays, and compliance with accurate dimensions.

Each section provides different levels of visibility depending on your endpoint configuration.

Condition level metrics: available at all points

Every SageMaker AI endpoint now has access to instance-level metrics, giving you visibility into what's happening with each Amazon Elastic Compute Cloud (Amazon EC2) instance in your endpoint.

Resource usage (CloudWatch namespace: /aws/sagemaker/Endpoints)

Track CPU usage, memory usage, and GPU usage for each and every host's memory usage. If a problem occurs, you can quickly identify which event needs attention. In accelerator-based environments, you'll see usage metrics for each accelerator.

Request metrics (CloudWatch namespace: AWS/SageMaker)

Track request patterns, errors, and delays down to the instance level. Monitor complaints, 4XX/5XX errors, model delays, and high latency with accurate dimensions that help you pinpoint exactly what problems you're experiencing. These metrics help you identify uneven traffic distribution, identify common error conditions, and associate performance issues with specific devices.

Container-level metrics: for components to consider

When using Inference Components to host multiple models in a single repository, you now have container-level visibility.

Resource usage (CloudWatch namespace: /aws/sagemaker/InferenceComponents)

Monitor resource usage per container. See CPU, memory, GPU usage, and GPU memory usage for each model copy. This seems to help you understand which copies of the inference component are using resources, maintain fair allocation in multi-tenant scenarios, and identify containers with performance issues. These detailed metrics include the size of InferenceComponentName again ContainerId.

Request metrics (CloudWatch namespace: AWS/SageMaker)

Track request patterns, errors, and latency at the container level. Monitor complaints, 4XX/5XX errors, model delays, and high latency with precise dimensions that help pinpoint where problems occur.

Prepares advanced metrics

Enable advanced metrics by adding one parameter when creating your endpoint configuration:

response = sagemaker_client.create_endpoint_config(
  EndpointConfigName="my-config", 
  ProductionVariants=[{ 
    'VariantName': 'AllTraffic', 
    'ModelName': 'my-model', 
    'InstanceType': 'ml.g6.12xlarge', 
    'InitialInstanceCount': 2 
  }], 
  MetricsConfig={ 
    'EnableEnhancedMetrics': True,
    'MetricsPublishFrequencyInSeconds': 10, # Default 60s
  })

Choosing your publishing frequency

After enabling advanced metrics, configure publishing frequency based on your monitoring needs:

Standard adjustment (60 seconds): Automatic frequency provides detailed visibility of many manufacturing operations. This is sufficient for energy planning, troubleshooting, and optimization, while keeping costs under control.

High resolution (10 or 30 seconds): For critical applications that require near real-time monitoring, allow 10-second bursts. This is important for dynamic calibration, highly variable traffic patterns, or deep troubleshooting.

Examples of using scenarios

In this post, we go through three common scenarios where Advanced Metrics deliver measurable business value, all of which are available in this notebook :

  1. Real-time GPU usage tracking for all Inference components

When running multiple distributed infrastructure models using Inference Components, understanding GPU allocation and utilization is critical for cost optimization and performance optimization. With advanced metrics, you can query GPU rendering for each inference component:

response = cloudwatch.get_metric_data( 
  MetricDataQueries=[ { 
    'Id': 'm1', 
    'Expression': 'SEARCH('{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"', 'SampleCount', 10)' 
  }, { 
    'Id': 'e1', 
    'Expression': 'SUM(m1)' # Returns GPU count 
  } ],
  StartTime=start_time, 
  EndTime=end_time )

This query uses the GpuId the maximum number of individual GPUs allocated to each inference section. Following the SampleCount Statistically, you get an accurate count of GPUs used for a specific Computational Component, which is important for:

  • Verifying resource allocation is similar to your configuration
  • It detects when the inference components go up or down
  • Calculates the cost per GPU for the chargeback models
  1. Cost allocation for each model in multi-model deployments

One of the most requested skills is understanding the true cost of each model when multiple models share the same endpoint infrastructure. Advanced metrics make this possible by tracking the GPU at the container level. Here's how to calculate incremental costs for each model:

response = cloudwatch.get_metric_data( 
  MetricDataQueries=[ {
    'Id': 'e1', 
    'Expression': 'SEARCH('{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"', 'SampleCount', 10)'
  }, { 
    'Id': 'e2', 
    'Expression': 'SUM(e1)' # GPU count 
  }, { 
    'Id': 'e3', 
    'Expression': 'e2 * 5.752 / 4 / 360' # Cost per 10s based on ml.g6.12xlarge hourly cost 
  }, { 
    'Id': 'e4', 
    'Expression': 'RUNNING_SUM(e3)' # Cumulative cost 
  } ], 
  StartTime=start_time, EndTime=end_time ) 

This figure:

  • Counts the GPUs assigned to the inference part (e2)
  • Calculates the cost per 10 seconds based on the hourly cost example (e3)
  • It collects the total cost over time using RUNNING_SUM (e4)

For example, with ml.g6.12xlarge for example ($5.752/hour for 4 GPUs), if your model uses 4 GPUs, the cost per 10 seconds is $0.016. I RUNNING_SUM provides an ever-increasing total, perfect for dashboards and expense tracking.

  1. Cluster wide resource monitoring

Advanced metrics enable comprehensive cluster monitoring by combining metrics across all inference components in the endpoint:

response = cloudwatch.get_metric_data( 
  MetricDataQueries=[ { 
    'Id': 'e1', 
    'Expression': 'SUM(SEARCH('{/aws/sagemaker/InferenceComponents,EndpointName,GpuId} MetricName="GPUUtilizationNormalized" EndpointName="my-endpoint"', 'SampleCount', 10))' 
  }, { 
    'Id': 'm2',
    'MetricStat': { 
      'Metric': { 
        'Namespace': '/aws/sagemaker/Endpoints', 
        'MetricName': 'CPUUtilizationNormalized', 
        'Dimensions': [ {
          'Name': 'EndpointName', 
          'Value': 'my-endpoint'
        }, {
          'Name': 'VariantName', 
          'Value': 'AllTraffic'
        } 
      ] }, 
      'Period': 10, 
      'Stat': 'SampleCount' # Returns instance count 
    } 
  }, { 
    'Id': 'e2', 
    'Expression': 'm2 * 4 - e1' # Free GPUs (assuming 4 GPUs per instance) 
  } ], 
  StartTime=start_time, EndTime=end_time ) 

This query provides:

  • Total number of GPUs used in all inference units (e1)
  • Number of events in the storage area (m2)
  • Available GPUs for new applications (e2)

This seems to be important for capacity planning and making sure you have enough resources to run new models or scale existing ones.

Creating performance dashboards

The accompanying notebook shows how to programmatically create CloudWatch dashboards that include these metrics:

from endpoint_metrics_helper import create_dashboard 
create_dashboard( 
  dashboard_name="my-endpoint-monitoring", 
  endpoint_name="my-endpoint", 
  inference_components=[ {
    'name': 'IC-model-a', 
    'label': 'MODEL_A'
  }, {
    'name': 'IC-model-b',
    'label': 'MODEL_B'
  } ], 
  cost_per_hour=5.752, 
  region='us-east-1' )

This creates a dashboard with:

  • Cluster level resource usage (eg, GPUs used/unused)
  • Cost tracking for each model with bundled pricing
  • Real-time cost in 10 second intervals

The notebook also includes useful widgets for ad-hoc analysis.

from endpoint_metrics_helper import create_metrics_widget, create_cost_widget
# Cluster metrics
create_metrics_widget('my-endpoint') 
# Per-model cost analysis
create_cost_widget ('IC-model-a', cost_per_hour=5.752)

These widgets provide a drop-down time range selection (last 5/10/30 minutes, 1 hour, or custom range) and display:

  • Number of cases
  • Full/used/free GPUs
  • Cost increases per model
  • Charges during 10 seconds

Best practices

  1. Start with a resolution of 60 seconds: This provides sufficient granularity for most use cases while keeping CloudWatch costs manageable. Note that only usage metrics generate CloudWatch charges. All other types of metrics are published at no additional cost to you.
  2. Optionally use 10-second resolution: Enable high-resolution metrics only at critical points or during troubleshooting sessions.
  3. Use measurements strategically: Use InferenceComponentName, ContainerIdagain GpuId drop rates from a cluster-wide view to specific containers.
  4. Create cost allocation dashboards: Implement RUNNING_SUM statements to track costs accrued by each model for accurate billing and budgeting.
  5. Set alarms on unused GPU capacity: Monitor unused GPU capacity metrics to make sure you maintain buffer capacity for scaling or new usage.
  6. Integrate with request metrics: Compare resource usage with request patterns to understand the relationship between traffic and resource usage.

The conclusion

Advanced Metrics for Amazon SageMaker AI Endpoints transforms the way you monitor, optimize, and deploy ML production workloads. By providing container-level visibility with configurable publishing frequency, you gain the operational intelligence needed to:

  • Accurately define costs for individual models in multi-tenant deployments
  • Monitor real-time GPU allocation and utilization across inference components
  • Track the resource availability of the entire cluster for capacity planning
  • Solve operational problems with precise, granular metrics

A combination of detailed metrics, flexible publishing frequency, and rich dimensions help you build sophisticated monitoring solutions that scale with your ML operations. Whether you're running a single model or managing dozens of inference components across multiple endpoints, advanced metrics provide the visibility you need to effectively deploy AI at scale.

Get started today by enabling advanced metrics on your SageMaker AI endpoints and check out the accompanying notebook for complete use cases and reusable assistant functions.


About the writers

frgud

Dan Ferguson

Dan Ferguson Solutions Architect at AWS, based in New York, USA. As a machine learning services specialist, Dan works to support customers on their journey to integrate ML workflows effectively, efficiently, and sustainably.

karpmar

Marc Karp

Marc Karp he is an ML Architect with the Amazon SageMaker Service team. He specializes in helping customers design, deploy, and manage ML workloads at scale. In her spare time, she likes to travel and explore new places.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button