Optimizing Pytorch model detection for aps Graviton

AI / ML models can be a very expensive endeavor. Many of our posts focus on a variety of tips, tricks, and techniques for analyzing and optimizing AI / ML runtime performance. Our contention is twofold:
- Performance and efficiency analysis must be an essential process for every AI / ML development project, and,
- Achieving meaningful operational efficiencies and cost reductions does not require large scale. Any AI / ML developer can do it. Every AI / ML developer should do it.
we discuss the challenge of optimizing the ML workload for infence on the Intel® Xeon® Propector. We started by reviewing many cases where the CPU can be the best way to AI / ml voting even in ENA many chips are dedicated to AIs. We then presented a Pytorch model of Pytorch distributed toy and demonstrated several techniques to increase its runtime performance on Amazon EC2 C7I.xLarge, powered by 4th generation Intel Xeon Consovol. In this post, we expand our discussion on AWS's homegrown glus standard. We will revisit many of the configurations we discussed in our previous posts – some of which will need to adapt to the ARM processor – and examine their impact on the same model of toys. Given the huge differences between Arm and Intel processors, the paths to the best configuration can take different paths.
AWS Graviton
AWS Graviton is a family of processors based on the Arm Neolover CPUS, that are custom designed and built by AWS for cost efficiency and power efficiency. Their dedicated engines for vector processing (Neon and SVE / SVE2) and matrix multiplication (MMLA), as well as their support for Bloaty16 functions (like Graviton3), make them a compelling person to use powerful workloads like AI / ML CONPER. To facilitate the highest AI / ML in Graviton, the entire software stack is designed for its use:
- Cobs with low levels from the Arm compute library (ACL) they are highly optimized for graviton Hardurators accelerators (eg, Sve and MMLA).
- ML MiddleAre libraries Like Onednn and Openblas Route East Learning and Linear Algebra Operations in special ACL Kernels.
- AI / ML framework Like Pytorch and tensorflow are integrated and optimized to use these optimized tables.
In this post we will use Amazon EC2 C8G.xGarge Starse powered by AWS AWS Graviton4 processors and AWS ArM64 Pytorch Deep Learning Ami (Dlami).
The purpose of this post is to show tips for maximizing the performance of AWS Graviton sessions. Mainly, our purpose -I Drawing comparisons between AWS Graviton and other chips, and should not encourage the use of one chip over another. The best choice of processor depends on a thorough evaluation of the considerations above the average of this post. One of the important considerations will be the maximum performance of your model's time on each chip. In other words: How much “bang” can we get for our buck? Therefore, making an informed decision about the best processor is one of the motivations for optimizing the runtime performance of each one.
Another goal is to optimize the performance of our model on multiple projection devices, to increase its interest. The AI / ML playing field is very dynamic and tolerance to changing conditions is essential for success. It is not uncommon for certain types of computing conditions to be unavailable or rare. On the other hand, the increased capacity of AWS Graviton sessions, can mean that their availability at discounts, e.g.
Statements of Statements
The source code we will share, the optimization steps we will discuss, and the results we will achieve, are intended as an example of the benefits you can see in ML efficiency in the AWS Graviton Example. This can vary greatly in the results you can see with your model and runtime environment. Please do not rely on the accuracy or goodness of the content of this post. Please do not interpret the mention of any library, framework, or platform as endorsing its use.
Working under AWS Graviton
As in our previous post, we will show the steps to use the toy layout model:
import torch, torchvision
import time
def get_model(channels_last=False, compile=False):
model = torchvision.models.resnet50()
if channels_last:
model= model.to(memory_format=torch.channels_last)
model = model.eval()
if compile:
model = torch.compile(model)
return model
def get_input(batch_size, channels_last=False):
batch = torch.randn(batch_size, 3, 224, 224)
if channels_last:
batch = batch.to(memory_format=torch.channels_last)
return batch
def get_inference_fn(model, enable_amp=False):
def infer_fn(batch):
with torch.inference_mode(), torch.amp.autocast(
'cpu',
dtype=torch.bfloat16,
enabled=enable_amp
):
output = model(batch)
return output
return infer_fn
def benchmark(infer_fn, batch):
# warm-up
for _ in range(20):
_ = infer_fn(batch)
iters = 100
start = time.time()
for _ in range(iters):
_ = infer_fn(batch)
end = time.time()
return (end - start) / iters
batch_size = 1
model = get_model()
batch = get_input(batch_size)
infer_fn = get_inference_fn(model)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")
The initial saturation is 12 samples per second (SPS).
Upgrade to the latest Pytorch release
While the version of Pytorch in our Dlami is 2.8, the latest version of Pytorch, at the time of this writing, is 2.9. Given the rapid pace of development in the AI / ML field, it is highly recommended to use the latest library packages. As our first step, we are upgrading to Pytorch 2.9 which includes important updates to its backend.
pip3 install -U torch torchvision --index-url
In the case of our model in its initial configuration, upgrading the Pytorch version has no effect. However, this step is important for getting the most out of the support strategies we will explore.
Combined slope
To reduce overhead and maximize the use of HW accelerators, we pool samples and use pooled trends. The table below shows how the model input varies as a function of batch size:
Memory Optimizations
We use some techniques from previous posts to optimize memory allocation and usage. These include Channels-Save format memory, mixed precision with the BFLaty16 data type (supported from Graviton3), TCMALLCOC sufficiency library, and large page allocation. Please see the details. We also provide energy quick calculations ACL Gemm Keerm Kermels mode, and kernel primitives cake storage – Two operations from the official guidelines for using Pytorch guides on Graviton.
The command line commands required to enable this functionality are shown below:
# install TCMalloc
sudo apt-get install google-perftools
# Program the use of TCMalloc
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4
# Enable huge page memory allocation
export THP_MEM_ALLOC_ENABLE=1
# Enable the fast math mode of the GEMM kernels
export DNNL_DEFAULT_FPMATH_MODE=BF16
# Set LRU Cache capacity to cache the kernel primitives
export LRU_CACHE_CAPACITY=1024
The following table captures the effect of memory position, used in order:

In the case of our toy model, Channels – Finally and Mixed precision BFloat16 efficiency had the greatest impact. After using all memory tolerances, the average load is 53.03 sps.
Model integration
Pytorch's integrated support for AWS Graviton is an area of focused effort for the AWS Graviton team. However, in the case of our toy model, it results in a slight reduction of the cross, from 53.03 sps to 52.23.
The acquisition of many employees
While they are commonly used in settings with multiple vcpus, we demonstrate the implementation of the discovery of workers who work in a traditional exchange of our Script to support the Core pin:
if __name__ == '__main__':
# pin CPUs according to worker rank
import os, psutil
rank = int(os.environ.get('RANK','0'))
world_size = int(os.environ.get('WORLD_SIZE','1'))
cores = list(range(psutil.cpu_count(logical=True)))
num_cores = len(cores)
cores_per_process = num_cores // world_size
start_index = rank * cores_per_process
end_index = (rank + 1) * cores_per_process
pid = os.getpid()
p = psutil.Process(pid)
p.cpu_affinity(cores[start_index:end_index])
batch_size = 8
model = get_model(channels_last=True)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")
We note that unlike other types of AWS EC2 CPU Instracy, Each Gvon VCPU maps directly to a single CPU. We use the Torchrun Utility to start four workers, each running on one CPU:
export OMP_NUM_THREADS=1 #set one OpenMP thread per worker
torchrun --nproc_per_node=4 main.py
This results in a transfer of 55.15 SPS, a 4% improvement over our best result.
Int8 Arm nudity
Another area of active development and continuous improvement of the arm is the IT8 light. Int8 Enemy tools are usually very bound to the type of the target instance. In our previous post we showed how to build Pytorch 2 export values for X86 with an inductor using the TorchAo (0.12.1) library. Fortunately, the latest models of TORTAO include a macinozer dedicated to the arm. The sequence of price updates is shown below. As in our previous post we are only interested in the potential performance impact. In practice, the construction of IT8 can have a significant impact on the quality of the model and can require a very sophisticated strategy.
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
def quantize_model(model):
x = torch.randn(4, 3, 224, 224).contiguous(
memory_format=torch.channels_last)
example_inputs = (x,)
batch_dim = torch.export.Dim("batch")
with torch.no_grad():
exported_model = torch.export.export(
model,
example_inputs,
dynamic_shapes=((batch_dim,
torch.export.Dim.STATIC,
torch.export.Dim.STATIC,
torch.export.Dim.STATIC),
)
).module()
quantizer = aiq.ArmInductorQuantizer()
quantizer.set_global(aiq.get_default_arm_inductor_quantization_config())
prepared_model = prepare_pt2e(exported_model, quantizer)
prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
optimized_model = torch.compile(converted_model)
return optimized_model
batch_size = 8
model = get_model(channels_last=True)
model = quantize_model(model)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")
The transfer result is 56.77 sps for 7.1% improvement over the BFloat16 solution.
AOT integration using Onnx and OpenVino
In our previous post, we explored real-time (aot) integration techniques using Open Neural Network Exchange (ONNX) and OpenVino. Both libraries include dedicated support for AWS Graviton functionality (eg see here and here). Testing in this section requires the following library installation:
pip install onnxruntime onnxscript openvino nncf
The following code block shows the integration of the model and implementation with ONNX:
def export_to_onnx(model, onnx_path="resnet50.onnx"):
dummy_input = torch.randn(4, 3, 224, 224)
batch = torch.export.Dim("batch")
torch.onnx.export(
model,
dummy_input,
onnx_path,
input_names=["input"],
output_names=["output"],
dynamic_shapes=((batch,
torch.export.Dim.STATIC,
torch.export.Dim.STATIC,
torch.export.Dim.STATIC),
),
dynamo=True
)
return onnx_path
def onnx_infer_fn(onnx_path):
import onnxruntime as ort
sess = ort.InferenceSession(
onnx_path,
providers=["CPUExecutionProvider"]
)
sess_options = ort.SessionOptions()
sess_options.add_session_config_entry(
"mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
input_name = sess.get_inputs()[0].name
def infer_fn(batch):
result = sess.run(None, {input_name: batch})
return result
return infer_fn
batch_size = 8
model = get_model()
onnx_path = export_to_onnx(model)
batch = get_input(batch_size).numpy()
infer_fn = onnx_infer_fn(onnx_path)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")
Kumele kwaziwe ukuthi i-ONX RUNTIME isekela i-ACL-Enterprovivivivivivivivivivivivivivivivivivivivivivivivivivivivivivivider esizinikezele se-ACL-Entekentekile ngokusebenza ngengalo, kepha lokhu kudinga ukwakhiwa ngokwezifiso kwe-ONNX (kusukela ngesikhathi sokubhala), okungesikhathi sokubhala.
Alternatively, we can compile the model using OpenVino. The Code Block below shows its use, including the Int8 Nightic option using NNCF:
import openvino as ov
import nncf
def openvino_infer_fn(compiled_model):
def infer_fn(batch):
result = compiled_model([batch])[0]
return result
return infer_fn
class RandomDataset(torch.utils.data.Dataset):
def __len__(self):
return 10000
def __getitem__(self, idx):
return torch.randn(3, 224, 224)
quantize_model = False
batch_size = 8
model = get_model()
calibration_loader = torch.utils.data.DataLoader(RandomDataset())
calibration_dataset = nncf.Dataset(calibration_loader)
if quantize_model:
# quantize PyTorch model
model = nncf.quantize(model, calibration_dataset)
ovm = ov.convert_model(model, example_input=torch.randn(1, 3, 224, 224))
ovm = ov.compile_model(ovm)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")
In the case of our toy model, the OpenVino integration has achieved an additional fulfillment of the error to 63.48 SPS, but the NNCF value is disappointing, resulting in just 55.18 SPS.
The result
The results of our tests are summarized in the table below:

Like ours, we sought our tests on the second model – the Vision Transformer (VIT) from the library of Timm – to show that the effect of the calculation of the time discussed in detail can vary with the availability and details of the model. The results are captured below:

To put it briefly
In this post, we reviewed some simple usability techniques and applied them to two toy pytorch models. As the results showed, the impact of each optimization step can vary depending on the details of the model, and the journey to optimal performance can take many different paths. The steps we presented in this post were just an appetizer; There is no doubt that most of the people who work well can open up to a greater performance.
Along the way, we noted many AI / ML libraries that have introduced deep support for the Graviton architecture, as well as a visible community-wide effort to improve performance. The profitable performance we have achieved, combined with this clear dedication, proves that Groviton aws are firmly in the 'big leagues' when it comes to good AI / ML tasks.



