How to use a lot of your local llms using Llama-Change in one server


Photo by writer | Ideogram
Running multiple models of great language can be helpful, whether comparing the model out, to set back in case of error, or behave in behavior (such as one coding model and other technical writing). This is how we often use llms to work. There are applications such as poe.com That gives this type of setup. It is one platform where you can run many llms. But what if you want to do it all local, keep in the cost of API, and keep your data private?
Yes, that is where the real problem comes from. Setting this often means a swing of different ports, practical procedures, and changes between hand. Not good.
It is very painful Llamap solves. It is a source of open source is very open (just binary), and allows you to change between many localized llm. In simple terms, listing Opena-style style calls on your machine and starts automatically or removes the correct support for the model you request. Let's break up how we work and we walk by step by step to get it into your local machine.
Obvious How Llama-Swaping Tasks Works
Specifically, lllama-swap remains in front of your llm servers as a smart router. When API application arrives (eg, a POST /v1/chat/completions Call), looks like "model" JSON PAYLOAD. Then load the appropriate procedure for the model, shut down any other model if required. For example, if you start asking for the model "A" and then ask the model "B"The lllama-Swap will automatically stop the server on “A” and starts the server at “B” for each application served by the appropriate model. This powerful fluctuation is clearly occurring, so clients see the expected answer without worrying about basic processes.
By default, lllama-swap allows only one model to work on time (Loading others when changing). However, its groups allow you to change this behavior. The party can write a few models and control their exchange. For example, to place swap: false In the group means all group members can run together without loading. In fact, you can use one group with heavy models (only one active at a time) and the other “group of” small models you want to run once once. This gives you full control of resources and concurrency to one server.
Obvious Requirements
Before you start, make sure your system has the following:
- Python 3 (> = 3.8): Basic writing and billing are required.
- English (in Macos): It makes the installation of llm rubimes easier. For example, you can enter Lama.cpp Server with:
This gives llama-server Binary with holding models in your area.
- llama.cpp (
llama-server): A compatible binary compatible server (installed with Homebrew above, or built on the well) actually conducts the llM model. - Kisses of CLI face: Downloading models directly from your local machine without sign-in on site or pages of navigation model. Enter using:
pip install -U "huggingface_hub[cli]"
- Hardware: Any modern CPU will work. Soon, GPU is helpful. (In Apple Silicon Macs, you can run in CPU or try Pytorch's MPS Backers for Support Models. In Linux / Windows with Nvidi GPUS, you can use Docker / Cuda) containers.)
- Designer (Optional): Conducting the pre-built docker photos. However, I have chosen not to use this for this guide because these pictures are mainly designed for X86 systems (Intel / AMD) and unemployed in Apple Silicon (M1 / M2) Macs. Instead, I used the installation method of empty metal, which worked directly to Macos outside of any higher container.
In short, you will need Python nature and local llm server (such as `LAMA.CPP` server). We will use this to handle two models for example in one machine.
Obvious Step instructions by step
// 1. Installing Llama-Swap
Download the latest lllama-Swap removal of your OS from GitHub removes the page. For example, I saw v126 as a recent release. Run the following instructions:
# Step 1: Download the correct file
curl -L -o llama-swap.tar.gz
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3445k 100 3445k 0 0 1283k 0 0:00:02 0:00:02 --:--:-- 5417k
Now, remove the file, make it up, and check it by checking the version:
# Step 2: Extract it
tar -xzf llama-swap.tar.gz
# Step 3: Make it executable
chmod +x llama-swap
# Step 4: Test it
./llama-swap --version
Output:
version: 126 (591a9cdf4d3314fe4b3906e939a17e76402e1655), built at 2025-06-16T23:53:50Z
// 2. To download and prepare for two or more llms
Choose two model models you can run. We will use QWEN2.5-0.5B including Smollm2-135m (small models) from Kisses face. You need model files (in Flesh or similar format) on your machine. For example, using Hugging Face Cection:
mkdir -p ~/llm-models
huggingface-cli download bartowski/SmolLM2-135M-Instruct-GGUF
--include "SmolLM2-135M-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
huggingface-cli download bartowski/Qwen2.5-0.5B-Instruct-GGUF
--include "Qwen2.5-0.5B-Instruct-Q4_K_M.gguf" --local-dir ~/llm-models
This will do this:
- Create a directory
llm-modelsIn your user's home folder - Download GGUF model files for safety on that folder. After downloading, you can confirm that there are:
Which is output:
SmolLM2-135M-Instruct-Q4_K_M.gguf
Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
// 3. Creating the Llama-Swap Configuration
Llama-Swap using one yaml file to define models and server instructions. Create a config.yaml Such content file:
models:
"smollm2":
cmd: |
llama-server
--model /path/to/models/llm-models/SmolLM2-135M-Instruct-Q4_K_M.gguf
--port ${PORT}
"qwen2.5":
cmd: |
llama-server
--model /path/to/models/llm-models/Qwen2.5-0.5B-Instruct-Q4_K_M.gguf
--port ${PORT}
Locate /path/to/models/ in your true local way. Each login underneath models: offers ID (as "qwen2.5") with a shell cmd: conducting its server. We use llama-server (from llama.cpp) with --model pointing to a gguf file and --port ${PORT}. This page ${PORT} Macro tells ullama-change giving free port on each model automatically. This page groups The part is optional. I left this example, so automatically, Illoma-change will only use one model at a time. You can customize the many options for each model (aliases, akouts, etc.) in this configuration. For more information on options available, see the full configuration file.
// 4. Running Llama-Change
With a binary and config.yaml Ready, Start Ullama-Change Point to Your Config:
./llama-swap --config config.yaml --listen 127.0.0.1:8080
This opens a representative server localhost:8080. Will read config.yaml and (at the beginning) the burden is no models until the original request arrives. Llama-Swap will manage API applications in Durban 8080To convey the correct industry llama-server process based on "model" parameter.
// 5. Communication with your models
You can now make Akai-Style API calls to test each model. Set kind If you do not have before using the instructions below:
// Using qwen2.5
curl -s
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "qwen2.5",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a popular general-purpose programming language. It is easy to learn, has a large standard library, and is compatible with many operating systems. Python is used for web development, data analysis, scientific computing, and machine learning.nPython is a language that is popular for web development due to its simplicity, versatility and its use of modern features. It is used in a wide range of applications including web development, data analysis, scientific computing, machine learning and more. Python is a popular language in the"
// Using smollm2
curl -s
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"model": "smollm2",
"prompt": "User: What is Python?nAssistant:",
"max_tokens": 100
}' | jq '.choices[0].text'
Output:
"Python is a high-level programming language designed for simplicity and efficiency. It's known for its readability, syntax, and versatility, making it a popular choice for beginners and developers alike.nnWhat is Python?"
Each model will respond according to its training. Llama-Swap beauty doesn't have to restart anything manually – just change "model" Field, and it's the rest. As shown in the above examples, you will see:
qwen2.5: More action, technical feedbacksmollm2: Simple, short answer above
What guarantees LLAMA switch is a routine requests in the relevant model!
Obvious Store
Congratulations! Set the Illama-burned to carry two llms on one machine, and you can now change between them in the Fly Via API Calls. We have filed a representative, prepared yaml configuration in two models, and we saw how the Llama-Swap methods asked back.
The following steps: You can increase this install:
- Large models (such as
TinyLlama,Phi-2,MistralSelected - Groups by the same work
- Compilation with Langchain, Fastapior other things
Enjoy checking different models and configurations!
Kanal Mehreen Are the engineering engineer and a technological author interested in the biggest interest of data science and a medication of Ai and medication. Authorized EBOOK “that added a product with chatGPT”. As a Google scene 2022 in the Apac, it is a sign of diversity and the beauty of education. He was recognized as a Teradata variation in a Tech scholar, Mitacs Globalk scholar research, and the Harvard of Code Scholar. Kanalal is a zealous attorney for a change, who removes Femcodes to equip women to women.



