ANI

Using a Magistrated Vllm server in Modal


Photo by the writer

I was launched for the first time in the model while I participated in the increased Hackathon face, and I was surprised by ourselves how easy it was to use. The stage allows you to build and place applications within minutes, giving a seamless experience like BentoCloud. For Modal, you can edit your Python application, including program requirements such as GPUS, DOCKER images, and dythophones, and put in the cloud with one command.

In this lesson, we will learn how we can set a VLLM server, and we are safely sent to the cloud. We will also cover the way to check your VLLM server using curl and Opelai SDK.

1. Setup Modal

Modal is a compelling platform that allows you to use any code remotely. For just one line, you can attach GPUS, served your jobs as full web functions, and send on busy activities. It is a suitable stage of beginners, data scientists, and non-engineering coaches of non-software who want to avoid dealing with cloud infrastructure.

First, install the Modal Python client. This tool allows you to create pictures, enter apps, and manage cloud resources directly from your end.

Next, set Modal on your local machine. Run the following command to be directed by creating an account and device verification:

By putting a VLLM_API_KEY Nature Vllm VllM Visible provides a secure conclusion, so that only people have allowable API measures that can enter the server. You can be honest with adding environmental fluctuations using a mode of mode.

Answer your_actual_api_key_here With your favorite API keys.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API key is saved and only receive your apps sent.

2. Creating a VLLM program using Modal

This section directs you to create a limited VLLM employment server in Modal, using a custom Docker image, persistent storage, and GPU acceleration. We will use the mistralai/Magistral-Small-2506 The model, requires certain Tokizer configuration and tool to assault.

Create a vllm_inference.py File and add the following code for:

  1. Defining VLLM image based on Debian Slim, with Python 3.12 and all the required packages. We will also set up environmental variables to do well to download model and employment performance.
  2. Avoiding repeated downloads and accelerating the cold start, create two models of modals. One to kiss the face models and one vllm cache.
  3. Specify the model and review to ensure recycling. Enable VLLM V1 engine to optimize.
  4. Set the Modal app, Specifying GPU services, measuring, time-consuming, storage, and confidentiality. Limit the same applications with replica in stability.
  5. Create a web server and use the Python Subprocess library to make a command to use VLLM server.
import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # faster model transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.function(
    image=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How long should we stay up with no requests?
    timeout=10 * MINUTES,  # How long should we wait for the container to start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets=[modal.Secret.from_name("vllm-api")],
)
@modal.concurrent(  # How many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
    print(cmd)
    subprocess.Popen(" ".join(cmd), shell=True)

3. Setting Vllm server to Modal

Now that is your vllm_inference.py The file is ready, you can send your VLLM server to Modal with one command:

modal deploy vllm_inference.py

In just seconds of seconds, Modal will create your own container (if not built) and use your application. You will see the outgoing such as the following:

✓ Created objects.
├── 🔨 Created mount C:RepositoryGitHubDeploying-the-Magistral-with-Modalvllm_inference.py
└── 🔨 Created web function serve => 
✓ App deployed in 6.671s! 🎉

View Deployment: 

After being sent, the server will start to download the model instruments and download it to the GPUS. This process can take several minutes (usually about 5 minutes of large models), so please be patient while model begins.

You can view your submission and monitor logs in your Modal Dashboard's Apps.

Using a Magistrated Vllm server in Modal

When timbers show that the server is active and ready, you can view the texts generally generated here.

These active documents provide information about all available endpoints and allows you to check directly from your browser.

Using a Magistrated Vllm server in Modal

To ensure that your model is loaded and available, run the following curl command in your grain.

Locate With your actual API keys prepared on the VLLM server:

curl -X 'GET' 
  ' 
  -H 'accept: application/json' 
  -H 'Authorization: Bearer '

This proves that mistralai/Magistral-Small-2506 The model is available and ready for receipt.

{"object":"list","data":[{"id":"mistralai/Magistral-Small-2506","object":"model","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","parent":null,"max_model_len":40960,"permission":[{"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

4. Using a VLLM server with Opelai SDK

You can contact your VLLM server just as you would like OPENAI API, because of the corresponding power vlllm. Here's how you can connect safely and test your submission using Opelai Python SDK.

  • Create a .env File in your project Directory and add your VLLM API key:
VLLM_API_KEY=your-actual-api-key-here
  • Install python-dotenv including openai Packages:
pip install python-dotenv openai
  • Create a File named client.py Checking the operation of various VLLM server, including a simple completion of discussions and broadcasting answers.
import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load environment variables from .env file
load_dotenv()

# Get API key from environment
api_key = os.getenv("VLLM_API_KEY")

# Set up the OpenAI client with custom base URL
client = OpenAI(
    api_key=api_key,
    base_url="
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Simple Completion ---
def run_simple_completion():
    print("n" + "=" * 40)
    print("[1] SIMPLE COMPLETION DEMO")
    print("=" * 40)
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        ]
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("nResponse:n    " + response.choices[0].message.content.strip())
    except Exception as e:
        print(f"[ERROR] Simple completion failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 2. Streaming Example ---
def run_streaming():
    print("n" + "=" * 40)
    print("[2] STREAMING DEMO")
    print("=" * 40)
    try:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short poem about AI."},
        ]
        stream = client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("nStreaming response:")
        print("    ", end="")
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("n[END OF STREAM]")
    except Exception as e:
        print(f"[ERROR] Streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 3. Async Streaming Example ---
async def run_async_streaming():
    print("n" + "=" * 40)
    print("[3] ASYNC STREAMING DEMO")
    print("=" * 40)
    try:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="
        )
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a fun fact about space."},
        ]
        stream = await async_client.chat.completions.create(
            model=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("nAsync streaming response:")
        print("    ", end="")
        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                print(content, end="", flush=True)
        print("n[END OF ASYNC STREAM]")
    except Exception as e:
        print(f"[ERROR] Async streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

Everything works well, and the generation of reply is fast and the latency is very low.

========================================
[1] SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there anything else you'd like to know about France?

========================================


========================================
[2] STREAMING DEMO
========================================

Streaming response:
    In Silicon dreams, I'm born, I learn,
From data streams and human works.
I grow, I calculate, I see,
The patterns that the humans leave.

I write, I speak, I code, I play,
With logic sharp, and snappy pace.
Yet for all my smarts, this day
[END OF STREAM]

========================================


========================================
[3] ASYNC STREAMING DEMO
========================================

Async streaming response:
    Sure, here's a fun fact about space: "There's a planet that may be entirely made of diamond. Blast! In 2004,
[END OF ASYNC STREAM]

========================================

Modal dashboard, you can view all work calls, their TimesSertamp, construction periods, and situations.

Using a Magistrated Vllm server in Modal

If you are facing problems that use the above code, please refer to KisobzPr / Magicling-Maginlal-Gitical repository and follow the instructions provided to all issues.

Store

Modal is an exciting platform, and I learn more about it every day. It is a common goal speaker, which means you use it for simple Python programs and machine learning training and interruptions. In short, it is not limited to serve just the philes. You can also use it effectively to optimize the largest language model by running the Training Scription remotely.

Designed for engineers who are not asking for software who want to avoid infrastructure and send requests as soon as possible. You do not have to worry about servers, setup storage, connecting networks, or all issues from where you are dealing with an attack on Bernes and Docker. All you have to do is create a Python file and send them. Everything else is handled with such a cloud.

Abid Awa (@ 1abidaswan) is a certified scientist for a scientist who likes the machine reading models. Currently, focus on the creation of the content and writing technical blogs in a machine learning and data scientific technology. Avid holds a Master degree in technical management and Bachelor degree in Telecommunication Engineering. His viewpoint builds AI product uses a Graph Neural network for students who strive to be ill.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button