Machine Learning

The loading llms using llmperf | Looking at the data science

Language model (llm) is not the last step in producing your Ai Generative Application system. The most commonly forgotten, but the most important part of the MLOPS Lifes LifeCycle is well loaded and explore your llm and ensure that it is ready to bear your productive production. FUNCTERS ACTIVITY EVERYTIES The habit of evaluating your application or model through traffic can expect in the manufacturing area to ensure that it works.

Defend in the past we discussed the traditional burden of traditional models using openythop tools like locust. Locust helps catch ordinary metrics that work as per second (RPS) and percents latency in each sense. While this works with traditional API traditional and ML models have made full of llms stories.

The traditional WLMS is the lowest RPS and a higher latency than traditional ML models because of its size and major computer requirements. Metric RPS usually does not truly provide the most accurate picture simply because applications may vary very much according to the LLM. For example, you may have a question to summarize the archaeological scroll with another question that may require a single word response.

That is why the tokens are recognized as a more accurate mesh of the LLM. At a higher level token is the text chunk, whenever the llM processes your installation of your “Tokenzes” installation. The token differs directly to the LLM you are using, but you can think of example as a name, a sequence of words, or letters.

Photo by the writer

This shark is included how we can produce metrics based on the topic you will have the way you will have a special testing tool for Various LLMs.

Let's look at the hands! If you are above the video-based student feel free to follow my corresponding YouTube video:

https: /www.youtube.com/watch? v = abirlc9gain

Booklet: This article takes the basic understanding of Python, LLMS, and Amazon Bedrock / Sagemaker. If you are new to Amazon Bedrock Please check my Starter guide here. If you want to learn more about the Sagemaker JumpStart LLM Shipping you refer to the video here.

And a statement: I am an artist in a surprise and my opinion on mine.

Content

  1. Llm metric lym
  2. Llmperf Intro
  3. Applying for Lllemper in Amazon Bedrock
  4. Additional Resources and Earth

Some Metric Metrics of LLM

As we briefly discuss with the introduction about the ILM hostage, metric metrics usually provide your answer to various loading sizes or types of questions (VNA. VNA.

Tradition always tracks RPS and the latency we will still see, but more than the tokens section. Here are some metrics I don't know before we start with load test:

  1. The first sign time: This is the time taken for the first token to produce. This is especially useful when distributing. For example when using Chatgpt we start processing the data when the first piece of text appear (token).
  2. Telephone tokens for a second time: This is the total number of tokens produced on the other moment, you can think of this as an alternative granur to the right one in the case.

These are the main metrics that will focus on it, and there are a few such ones such as inter-Token Latency to be shown as part of load tests. Remember parameters that contribute to these metrics include the expected installation and the size of the output tokens. We play directly with these structures to find an accurate understanding of our response to how to answer different generation activities.

Now let's look at the instrument that enables us to convert these parameters and display the appropriate metrics we need.

Llmperf Intro

Lllemperf is designed on top of Ray, a famous Computing Python framework. LLMPERF directly is a ray to create the exercises tests that are distributed where we can imitate the actual production traffic.

Note that any load testing tool will generate your expected traffic amount if customer machine has enough power to force your expected responsibility. For example as you measure the consensus or overload your model, you can also want to measure customer machine where your load test.

Now directly within the LLMPERF has a few parameters that are designed to be organized by the LLM load test as we discussed:

  • Statue: This is a model provider and your work model. For our use – either Amazon Bedrock and Claude 3 Sonnet right directly.
  • Llm API: This is the API format where the payment must be organized. We use Liyollm that offers a limited structure of the variety of models, thus simplifying the setting process in us especially if we want to test the different models held on different platforms.
  • Input tokens: Input token length, you can specify the general deviation of this number.
  • Tokens for exit: Exit token length, you can specify the general deviation of this number.
  • Similar applications: The number of similar loading applications to imitate.
  • Testing Time: You can control the test time, this parameter is enabled in seconds.

Lllemperf specifically identifies all these parameters through Token_benchmark_ray.py preparing for specific values. Let's see now how we can especially stop this in Amazon Bedrock.

Applying for Lllemper in Amazon Bedrock

Putting Time

This example will apply to Sagemaker Classic Notebook by Colsa_python3 Kernel including ml.g5.12xigerge For example. Note that you want to choose an example that has enough computers to produce the load load you want to imitate. Make sure you have your own llmper's AWS guarantees to reach the model held model can be a bedrock or sagemaker.

Litell Configuration

We start preparing our make-up Lillm llm API make-up in this case. With Lilolls Support for all the Provol providers, this time we prepare an API to finish ending with Amazon Bedrock:

import os
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = "Enter your access key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret access key"
os.environ["AWS_REGION_NAME"] = "us-east-1"

response = completion(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.choices[0].message.content
print(output)

Working with beedrock prepares the model ID to identify in Claudude 3 Sonnet and pass through our tremity. The neat part of Libolm that the e messaging key has a fixed format for all Model providers.

Post postal execution here may focus on the exact bear lllemk right.

The combination of llmperf bedrock

To make a load test with lLMPERF we can only use token_benchmark_ray.Spy and pass through the following parameters in the past:

  • Input tokens also mean normal deviations
  • The outdoor tokens also mean normal deviation
  • A large number of test applications
  • TIME OF TEST
  • Similar applications

In this case we also specify our API format to become a litology and we can do the load test with a simple script in the shell as follows:

%%sh
python llmperf/token_benchmark_ray.py 
    --model bedrock/anthropic.claude-3-sonnet-20240229-v1:0 
    --mean-input-tokens 1024 
    --stddev-input-tokens 200 
    --mean-output-tokens 1024 
    --stddev-output-tokens 200 
    --max-num-completed-requests 30 
    --num-concurrent-requests 1 
    --timeout 300 
    --llm-api litellm 
    --results-dir bedrock-outputs

In this case we maintain low consensus, but feel free to convert this number based on what you expect to be produced. Our exam will work for 300 seconds and shipping period where you should see the release directory and two files representing each number and mathemakers in all applications during the test period.

We can do this a little little looks by installing a summary of pandas:

import json
from pathlib import Path
import pandas as pd

# Load JSON files
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")

with open(individual_path, "r") as f:
    individual_data = json.load(f)

with open(summary_path, "r") as f:
    summary_data = json.load(f)

# Print summary metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
    "Model": summary_data.get("model"),
    "Mean Input Tokens": summary_data.get("mean_input_tokens"),
    "Stddev Input Tokens": summary_data.get("stddev_input_tokens"),
    "Mean Output Tokens": summary_data.get("mean_output_tokens"),
    "Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
    "Mean TTFT (s)": summary_data.get("results_ttft_s_mean"),
    "Mean Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
    "Mean Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
    "Completed Requests": summary_data.get("results_num_completed_requests"),
    "Error Rate": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Performance Summary:n")
for k, v in summary_metrics.items():
    print(f"{k}: {v}")

The test results of the final load will look like something similar to the following:

Screenshot about writer

Since we can see we see prepared installations, and there are compatible consequences for the start of the tokens (or) in the transfer of the output tokens.

In the case of real surgery you can use the Lllemperf through all the various model providers and run tests in all platforms. For this tool you can use it to identify the appropriate model and your storefront storage stack when using a scale.

Additional Resources and Earth

All sample code can be found in this area related to GitUB. If you want to work with Sagemaker Endpoints You can find the Llama JumpStart shipping sample to ship here.

Everything in all load assessment and valuable tests both important in ensuring your llm is working against your expected traffic before pressing the production. In the coming articles we will not just cover the test part, but what a complete test we can build with both parts.

As always thank you for reading and feeling free to leave any report and connect with me on Linkedln and X.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button