Accelerate HPC research and AI at universities through Amazon Sagemaker HyperPod

nimda September 5, 2025

0 8 5 minutes read

Accelerate HPC research and AI at universities through Amazon Sagemaker HyperPod

This post was written with Mohamed Hossam of brightskies.

Research universities participated in Ai-Sipale Ai and higher work computer (HPC) often dealt with major infrastructure challenges that affect new items and delay the research results. HPC clusters of HPC is coming to long GPU cycles, solid measurement measures, and complex requirements. These issues limit investigators to reduce AI services such as natural language activities (NLP), computer training, and Foundation Model (FM). Amazon Sagemaker HyperPod is sacrificing a consistent heavier involvement involved in creating AI models. It helps to quickly measure the model development activities such as training, good formation, or part of a part of AI Acceler

In this post, we show how the university used is used by Sagemaker HyperPod to accelerate the cost of the GPU cost, and the Compute Compute Competing Consumer in HyperPod Environment.

Looking for everything

Amazon Sagemaker HyperPod is designed to support the study projects of the largest auditors and ML scientists. The service is fully managed by AWs, removing over an operative formakhead during the maintenance of the business grade and operating.

The next building drawing shows how to get the Sagemaker HyperPod to submit jobs. The last users can use the AWS Site-to-site VPN, clients Client VPN clients, or the AWS Connect to secure the Sagemaker Hyperpod Cluster. This communication completes the network load of SSH traffic to find loginal areas, which is the first entry of service delivery and the connection of the collection. Building spine is Sagemaker HyperPod compute, control of that orchestrates of work, and many computers computers organized in the grid. This supports a successful training work distributed by prompt communication between locations, all content within the improved Subnet security.

The final infrastructure is built in about two main parts: Amazon FSX of Luster provides high file functional skills, as well as the last Amazon S3 dedicated datasets and checkpoints. This form of final storage provides access to immediate data to train work loads and protective persistence of the training materials.

The implementation of it was several sections. In the following steps, we show how we can use and prepare a solution.

Requirements

Before sending Amazon Sagemaker HyperPod, make sure the following qualities are in place:

AWS Configuration:
- The AWS Command Line interface (AWS CLL) is prepared for appropriate permits
- The configuration group configuration files are prepared: cluster-config.json including provisioning-parameters.json
Network setup:
The Role of AWS IDentity and Management (IAM) The Role Role of the following:

Launch Cloudformation Station

We have introduced the AWS Cloudformation stack to the provision of the required infrastructure, including VPC and subx, FSX of Luster File, S3 look at Amazon Sagemaker HyperPod

Customize Slur Cluster Configuration

Synchronizing Compute Needs Department's research resources, we created Slurf's division to indicate the formation of the organization, for example NLP, a computer view and depth learning groups. We have used the configuration of the Slurt separation to explain slurm.conf through customized parts. Slerm accounting is permitted with preparation slurmdbd and to connect to the use of departmental accounts with management.

Support to share francity GPU and effective use, enabled us to be encrypted by normal app (GRES). With GPU string, many users can enter GPUS in the same place without conflict. Gres setup followed guidelines from Amazon Sagemaker Hyperpod Workshop.

Provision and confirm the collection

We have confirmed the cluster-config.json including provisioning-parameters.json Files that use AWS CLI and SAGENKER HyperPod to ensure confirmation:

$curl -O 

$pip3 install boto3

$python3 validate-config.py --cluster-config cluster-config.json --provisioning-parameters provisioning-parameters.json

Then we created a group:

$aws sagemaker create-cluster 
  --cli-input-json file://cluster-config.json 
  --region us-west-2

Use Cost Tracking Cost and Budget

Monitoring costs for use and control, each Sagemaker HyperPod service (for example, Amazon EC2, the Luster, and other) is a unique ClusterName tag. The AWS budgets and AWS reports that cost Explorers were prepared to track the spending monthly. Additionally, alerts were introduced to inform researchers when they approach their quota or budget media.

This integration helped to prepare effective financial spending and expense expenditure.

Enable to balance load of login node loads

As the number of Velitor users increased, the University welcomes the art of many login. The two entry places were sent to EC2 auto scaling groups. Network Load Labalancer was organized into target groups to download SSH and Systems Manager traffic. Lastly, the activities of the AWS LAMBDA is forced for each user's limits using Run-As Tags with a session chair, the power of the system of the system.

For details about the full implementation, see the startup launching access location in the Socemaker HyperPod loader with enhanced user.

Configure Integrated Access and User Map

To facilitate secure and unsuccessful access to investigators, an institution compiled by the AWS IAM ID with On-Premericic Actiectory (Ad) using AWS Directory service. This is allowed by integrated control and user ownership management and access rights to Sagemaker HyperPod's Sagemaked accounts. The implementation of the following key issues:

Consolidated user integration – Map Maps Maps Posix Users Usernames Usernames Using a System Manager run-as Tags, to allow for deceptive manifest control of compute node
Safe Session Management – Prepare for the programmers to ensure users to compute node using their accounts, not automatic ssm-user
A marker-based mark – Consolidated user names are automatically written in user's directions, jobs, and budgeted with Resources tags

With the full guide action by step, refer to the Amazon Sagemaker HyperPod Workshop.

This approach is aimed at the user provision and access to access while maintaining strong alignment and institutional policies and compliance requirements.

Post-Deploy Deploy Optimizations

To help protect the unnecessary use of unemployed resources, University is prepared for the slumor in accordance with the Pam. This setup emphasizes the default of users after their slirm function is complete or canceled, supports the immediate availability of the fees of the parked tasks.

Configuration Editing Function With Face Transfer Promotion By releasing unemployment.

In addition, QS policies are prepared to control the use of resources, restrict the duty length, and use the best GPU access to all users and departments. For example:

Maxtresperyier – Make sure the use of GPU or CPU user is within the specified limits
LafocolDurationBooob – helps to prevent long tasks from the depths of power
The most important instruments – Badly important plan based on research team or project

These enhancements help the prepared, high HPC model that suits the shared infrastructure model for educational research centers.

Clean

Deleting resources and protects continuous reading, Complete the following steps:

Remove Sagemaker HyperPod Cluster:

$aws sagemaker delete-cluster --cluster-name

Remove the Cloudformation Stack used for Sagemaker HyperPod infrastructure:

$aws cloudformation delete-stack --stack-name  --region

This will automatically delete the accompanying resources, such as VPC and subnets, FSX for the Luster File, S3 Bucket, and IAM rods. If you create these resources without the clourformation, you must delete them manually.

Store

SAGENKER HyperPod provides research universities for a strong, fully HPC-owned solution designed for different AI function. Automaticization of infrastructure, measuring, and performing resources, institutions can speed up new items while maintaining budget management and budget management. With the custom of the Slur, GPU allocation is used by grords, a sidelity access, a powerful sign-signing system, so researchers can concentrate on science, not infrastructure.

For more information about doing most Sagemaker HyperPod, check Sagemaker HyperPod Workshod and check out some blog news with Sagemaker HyperPod.

About the authors

Tasneem fathima Is the construction of senior solutions in AWS. Supports higher education and research customers in the United Arab Emirates to welcome cloud technology, improve their time in science, and remove AWS.

Mohamed Hossam The construction of HPC Cloud Solutions Eve on Brightskies, which cares for high performance operations (HPC) and AWI infrastructure. Supporting universities and research at centers throughout the Gulf and the Middle East in combining GPU collectors, accelerating the approval of Ai, HPC / AI / ML / ML Domination. In his free time, Mohamed enjoys playing video games.