Benchmarking sealed markets in Amazon Bedrock using LLMPERF and Litellm

Opening Basic Models (FMS) Allocated the organizations create AI customs for the wishes of AI as well organized by their domains or certain functions, while keeping cost management and export. However, shipments can be part of the effort, usually required 30% of the project period because engineers should properly use the types of books and prepare for careful parameters. This process can be complicated and time-consuming, requiring special information and future assessment to complete the operation you want.
Amazon Bedrock Custom Volevol may be added to the Number of Shipment Modes by giving a direct API of model and persuading. You can upload the model instruments and allow AWS to handle total, fully owned. This makes sure the shipment is active and expensive. Amazon Bedrock Custom Vololic Implevol Custom Model and handles default stumps, including rating in zero. If it doesn't work and no call for 5 minutes, up to zero. You only pay what you use in 5 minutes increase. It is also delayed measuring, default the amount of copies of active models where a high agreement is achieved. These features make Amazon Bedrock Custom Custom to import a solution that is attractive of organizations that want to use custom models in Amazon Bedrock that provides accuracy and cost effective.
Before sending these models to production, it is important to check their performance using measurement tools. These tools help find potential potential for manufacturing issues such as fairness and ensure that the shipment can handle heavy production loads.
This post starts a blog series Assessed Deepseek and open FMS in Amazon Bedrock Custom Model Model. It includes the process of measuring custom models in Amazon Bedrock using open source tools: LLMPERF NELLIALM. Including a documentation letter of steps by sending a Deepseek-R1-Llama-8B model, but the same steps apply to any other model supported by the Amazon Bedrock Custom Model Model.
Requirements
This post requires the Amazon Bedrock Custom Custom Model. If you do not have one of your AWS account, follow the instructions from Deploy Deepsek-R1 DePilled Llama Models with Amazon Bedrock Custom Model.
Using the tools of the open source of llmperf and SoftLlimal measuring performance
Making operational performance surveillance, you will use LLMTMPERF, open open opening library to write basic models. LLMPERF imitates load tests in the Apis Models by creating similar clients rays and analyzes their answers. The main benefit of llmperf is a broad support for the Foundation Model Apis. This includes Laelllm, which supports all models found at Amazon Bedrock.
To set your custom model supplier for Lidellm
Lilollm is a tool for various open sources that can be used both as Python SDK and a representative server (Ai Gateway) for more than 100 different FMS using normal format. Lilollm has stimulated input to match specific FM decodes of FM. Sponsoring Amazon Bedrock Apis, including InvokeModel
and a variable api, and the FMS available at Amazon Bedrock, including import models imported.
Asking a custom model with Didelly, using the model parameter (see Amazon Bedrock Madrock Madlunga in Lilayallm). This is the following line bedrock/provider_route/model_arn
format.
This page provider_route
Indicates the use of the application / specified specification to use. Deepseek R1 models can be destroyed using their custom service template that uses a deeper R1, or in the Llama Chat's version using LLAMA provider's route.
This page model_arn
The Name of Amazon Resource (Arn) of the Model is imported. You can get your submitted model model in console or by sending a legitimondeddedModels.
For example, the following text releases a custom model using a deep R1 dialog for R1.
import time
from litellm import completion
while True:
try:
response = completion(
model=f"bedrock/deepseek_r1/{model_id}",
messages=[{"role": "user", "content": """Given the following financial data:
- Company A's revenue grew from $10M to $15M in 2023
- Operating costs increased by 20%
- Initial operating costs were $7M
Calculate the company's operating margin for 2023. Please reason step by step."""},
{"role": "assistant", "content": ""}],
max_tokens=4096,
)
print(response['choices'][0]['message']['content'])
break
except:
time.sleep(60)
After the propertated model lift parameters are confirmed, you can edit the lellperf by measuring.
Preparing for Token bench test with llmperf
Attending Benchmark, LLMPERF uses Ray, a distributed computer frame, imitating reasonable responsibilities. It blows many remote clients, each one can send similar applications to synchronize the models. These customers are activated as players that are similar. llmperf.requests_launcher
It regulates the distribution of applications to all the ray clients, and allows the various responsibility to the load and the same patterns of the application. At the same time, each client will collect operations during the applications, including latency, fill, and error prices.
Two main metrics of work inputs include suruter including turn:
- The latency means the time we need to use one request.
- Expulsion measures the number of tokens that are produced per second.
Choosing the correct configuration to use FMS typically involves trying with a different batch size while monitoring the GPU size and thinking about available memory, the model size, and specific work requirements. To learn more, see to fulfill the AI response: An effective guide to find the Amazon Bedrock Latency – prepared. Although importing an Amazon Bedrock model makes this easier by providing pre-operative configuration, it is still important to ensure submission and installation of your Latenction.
Begin with preparation token_benchmark.py
a sample text that helps the logging test marks. In Iskhipk, you can describe parameters such as:
- Llm API: Use LITELLM to request amazon Bedrock models in customized customization.
- Model: Describe the route, API, and the model arn to request a similar phase.
- Means a regular deviation of the installation tokens: Parameters to be used in distribution may have been filtered.
- Means a regular deviation of exit tokens: Parameters to be used in distribution may have been filtered.
- Number of similar applications: The number of users is that the app may be supported when it is used.
- Number of completed applications: The total amount of post applications to the llm API in the test.
The following text displays an example of how to ask the model. See this directory commandment bookmark on step when entering a custom model and is conducting a measuring test.
python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \
--model "bedrock/llama/{model_id}" \
--mean-input-tokens {mean_input_tokens} \
--stddev-input-tokens {stddev_input_tokens} \
--mean-output-tokens {mean_output_tokens} \
--stddev-output-tokens {stddev_output_tokens} \
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \
--timeout 1800 \
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \
--results-dir "${{LLM_PERF_OUTPUT}}" \
--llm-api litellm \
--additional-sampling-params '{{}}'
At the end of the test, the LLMPERF will release two JSON files: one with metrics aggregate, and one with different installation in all supplications.
Scales to zero and a cold latency
One thing to remember is because the Amazon Bedrock Custom Model will add to Zero where the model is not used, you first need to apply for one effective model of model. If you find a mistake that shows that the model is wrong, you need to wait for approximately ten seconds and reach 1 minute of Amazon Bedrock to prepare at least one Amazon copy. When you're ready, run the test urgency again, and continue to balance.
For example Deepseek-R1-Pepill-Llama-8B
Think a DeepSeek-R1-Distill-Llama-8B
The model managed in Amazon Bedrock Custom Model Templevol, which supports the AI system of the low non-NO. To respond to variations, you can change counting parameters for the first and completion of. For example:
- Number of customers: 2
- It means the calculation of Token token: 500
- The general deviations of the general insertion of the installation: 25
- It means the calculation of the Token token: 1000
- General Deliveration of Telephone: 100
- Number of applications for each customer: 50
This is a reflection test for about 8 minutes. At the end of the test, you will find a summary of the metrics of metric Aggregate results:
inter_token_latency_s
p25 = 0.010615988283217918
p50 = 0.010694698716183695
p75 = 0.010779359342088015
p90 = 0.010945443657517748
p95 = 0.01100556307365132
p99 = 0.011071086908721675
mean = 0.010710014800224604
min = 0.010364670612635254
max = 0.011485444453299149
stddev = 0.0001658793389904756
ttft_s
p25 = 0.3356793452499005
p50 = 0.3783651359990472
p75 = 0.41098671700046907
p90 = 0.46655246950049334
p95 = 0.4846706690498647
p99 = 0.6790834719300077
mean = 0.3837810468001226
min = 0.1878921090010408
max = 0.7590946710006392
stddev = 0.0828713133225014
end_to_end_latency_s
p25 = 9.885957818500174
p50 = 10.561580732000039
p75 = 11.271923759749825
p90 = 11.87688222009965
p95 = 12.139972019549713
p99 = 12.6071144856102
mean = 10.406450886010116
min = 2.6196457750011177
max = 12.626598834998731
stddev = 1.4681851822617253
request_output_throughput_token_per_s
p25 = 104.68609252502657
p50 = 107.24619111072519
p75 = 108.62997591951486
p90 = 110.90675007239598
p95 = 113.3896235445618
p99 = 116.6688412475626
mean = 107.12082450567561
min = 97.0053466021563
max = 129.40680882698936
stddev = 3.9748004356837137
number_input_tokens
p25 = 484.0
p50 = 500.0
p75 = 514.0
p90 = 531.2
p95 = 543.1
p99 = 569.1200000000001
mean = 499.06
min = 433
max = 581
stddev = 26.549294727074212
number_output_tokens
p25 = 1050.75
p50 = 1128.5
p75 = 1214.25
p90 = 1276.1000000000001
p95 = 1323.75
p99 = 1372.2
mean = 1113.51
min = 339
max = 1392
stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034
In addition to a summary, you will find Methodist Metrics for individual applications that can be prepared for details such as the following micrograms The first sign time including The Cahop Token.
To analyze the effects of work from LLMPERF and estimate cost using Amazon Cloudwatch
LLMPERF gives you the ability to measure the performance of the custom models and serve at Amazon Bedrock unless you checked the Amazon Bedrock model details of your model. This information is important because it represents the final user's last user experience.
In addition, the measurement prevention can serve as an important cost balancing device. Using the Amazon Cloudwatch, you can view the number of copies of active models that will add the Amazon culture scales to respond to load test. Modelcopy is displayed as Cloudwatch Metric in the AW / Bedrock Nameepace and reportedly used the Arn Model as the label. The plot of ModelCopy
The metric is shown in the number below. This data will help to measure the costs, for the payment is based on the number of copies of active models at the time provided.
Store
While the Amazon Bedrock Custom Model Model DePPovision Model DePPovision Model DePPovioticies and Scarling, the performance is important to predict the production functionality, and compare models to all the main metrics, latency, and pass.
To learn more, try a letter of writing by example with your custom model.
Additional Resources:
About the authors
Felipe Lopez You are a special AI / ML Specialist Solutions by AWS. Before joining the AWS, Felipe worked with a digital ge digital and SLB, where they focus on modeling products and the efficiency of industrial applications.
Rupinder grewal Is the construction of senior AI / ML solutions with AWS. You are currently focused on the provision of models and mlops on Amazon Sagemaker. Before the passage, he worked as the construction of the learning engineer and the holding models. Outside work, she enjoys playing tennis and cycling on the mountains.
Paras mehra It is the main product manager in AWS. Focused on helping the Amazon Bedrock. In his spare time, the parasites enjoy spending time with his family and cycling around the Bay area.
Prashant patel Development engineer higher AWS Bedrock. He is passionate about measuring large models of languages through business apps. Before joining AWS, he worked in IBM in producing large AI / ML responsibilities in the laboratory. Prashant has a master's degree from the Nyu Tandon School of Engineering. Although not working, she enjoys walking and playing with her dogs.