Creating Age Age with LLAMA.CPP


Photo by the writer
LLAMA.CPP is the first time, a higher performance framework that gives a lot of popular power ai, including Ollama, local conversations, and the LLM solutions. By working directly with LLAMA.CPP, you can reduce, get tricky control, and improve your specific hardware performance, to make your local Agents and applications as soon as possible
In this lesson, I will guide the AI programs using LLAMA.CPP, a powerful Library of the Power of C / C ++ to use Language Models (LLMS) well. We will cover the Setting Server.CPP server, including Langchain, and create a startup agent to use tools such as Web Search and Python Rep.
1. To set the LLAMA.CPP server server server
This section includes the installation of Lama.cpp and its dependence, preparation for the CDA support, creating the necessary age, and using the server.
Note: We use NVIFFA RTX 4090 4090 Drawing Card on the Working Plan of Linux with a prepared Cuda Toolkit first. If you do not have access to the same location hardware, you can hire GPU situations from the IMST.AI at a cheap price.


Screenshot from Mast.I | Consolidation
- Update your system package list and enter important tools like Building-important, CSAKE, Curl, and Git. PCiutils are installed with hardware information, and Libcurl4-Opensl-Dev is required at Lillama.cpp to download models from the face of face.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git -y
- Clone The official Lo Loating LOCKLAMA.CPPP from GitHub and use a cmake to stop construction.
# Clone llama.cpp repository
git clone
# Configure build with CUDA support
cmake llama.cpp -B llama.cpp/build
-DBUILD_SHARED_LIBS=OFF
-DGGML_CUDA=ON
-DLLAMA_CURL=ON
- Make Llama.cpp and all its tools, including server. For convenience, copy all integrated reports from Language.cpp / Building / Bin / Directory on Mlama.Cpp / Directory.
# Build all necessary binaries including server
cmake --build llama.cpp/build --config Release -j --clean-first
# Copy all binaries to main directory
cp llama.cpp/build/bin/* llama.cpp/
- Start LLAMA.CPP Server with Unloth / Germa-3-4b-IT-GGUF model.
./llama.cpp/llama-server
-hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
--host 0.0.0.0
--port 8000
--n-gpu-layers 999
--ctx-size 8192
--threads $(nproc)
--temp 0.6
--cache-type-k q4_0
--jinja
- You can check if the server works well by sending a post application using curl.
(main) [email protected]:/workspace$ curl -X POST
-H "Content-Type: application/json"
-d '{
"messages": [
{"role": "user", "content": "Hello! How are you today?"}
],
"max_tokens": 150,
"temperature": 0.7
}'
Which is output:
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"nOkay, user greeted me with a simple "Hello! How are you today?" nnHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes "being" but in a friendly way. nnI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. nnSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. nnI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}],"created":1749319250,"model":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","usage":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}
2. Build Agent AI with Langgraph and LLAMA.CPP
Now, let's use Langgraph and Langchain to participate with LLAMA.CPP server and create a lot of tool ai agent.
- Set your API locker key to search skills.
- For Langchain to work with Likama.cpp server (allowing Opelai API), you can set Opena_api_api Local string or any Anjeli Rende, as a_our _rl will guide applications in your area.
export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=local
- Apply the required libraries of the Python: Langgraph for creating agents, Tavily-Python of basic searching tool, and various Langchains packages of LLM and tools.
%%capture
!pip install -U
langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai
- Configure ChatopenaA from Langchain communicating with your Libama.cpp location server.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",
temperature=0.6,
base_url="
)
- Set tools that will be able to use them.
- TavilySearchResuls: Allows the agent that searches the web.
- PythonrePltool: It provides a agent with Python Road-Print-Print Loop to issue the code.
from langchain_community.tools import TavilySearchResults
from langchain_experimental.tools.python.tool import PythonREPLTool
search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool = PythonREPLTool()
tools = [search_tool, code_tool]
- Use the Pre-Building Lang Graph's work in the Clean_RECT.
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
model=llm,
tools=tools,
)
3. Check AGE AI for example questions
Now, we will examine AI agent and indicate which tools are used by the agent.
- This assistant issues the names of the agent used in the history of the discussion. This is helpful in understanding the process of making agent decisions.
def extract_tool_names(conversation: dict) -> list[str]:
tool_names = set()
for msg in conversation.get('messages', []):
calls = []
if hasattr(msg, 'tool_calls'):
calls = msg.tool_calls or []
elif isinstance(msg, dict):
calls = msg.get('tool_calls') or []
if not calls and isinstance(msg.get('additional_kwargs'), dict):
calls = msg['additional_kwargs'].get('tool_calls', [])
else:
ak = getattr(msg, 'additional_kwargs', None)
if isinstance(ak, dict):
calls = ak.get('tool_calls', [])
for call in calls:
if isinstance(call, dict):
if 'name' in call:
tool_names.add(call['name'])
elif 'function' in call and isinstance(call['function'], dict):
fn = call['function']
if 'name' in fn:
tool_names.add(fn['name'])
return sorted(tool_names)
- Describe work to use an agent with a given question and print the tools used and the last response.
def run_agent(question: str):
result = agent.invoke({"messages": [{"role": "user", "content": question}]})
raw_answer = result["messages"][-1].content
tools_used = extract_tool_names(result)
return tools_used, raw_answer
- Let's ask an agent of the 5 top news. You have to use Tavily_search_rerrrults_Json Tool.
tools, answer = run_agent("What are the top 5 breaking news stories?")
print("Tools used ➡️", tools)
print(answer)
Which is output:
Tools used ➡️ ['tavily_search_results_json']
Here are the top 5 breaking news stories based on the provided sources:
1. **Gaza Humanitarian Crisis:** Ongoing conflict and challenges in Gaza, including the Eid al-Adha holiday, and the retrieval of a Thai hostage's body.
2. **Russian Drone Attacks on Kharkiv:** Russia continues to target Ukrainian cities with drone and missile strikes.
3. **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, but Russia's Africa Corps remains.
4. **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk could have implications for Tesla stock and the U.S. space program.
5. **Education Department Staffing Cuts:** The Biden administration is seeking Supreme Court intervention to block planned staffing cuts at the Education Department.
- Let's ask an agent that writes and issued the Python code for the Fibonacci series. You should use Python_rePl tool.
tools, answer = run_agent(
"Write a code for the Fibonacci series and execute it using Python REPL."
)
print("Tools used ➡️", tools)
print(answer)
Which is output:
Tools used ➡️ ['Python_REPL']
The Fibonacci series up to 10 terms is [0, 1, 1, 2, 3, 5, 8, 13, 21, 34].
The last thoughts
In this guide, I have used a small small llm, sometimes witnessing accurately, especially when it comes to the selection of tools. If your aim is to create AI agents for production, I am very recommending using the latest models, full of llama.cpp. Large and recent models offer better results and reliable results
It is important to note that Setting Illama.cp may be a major challenge compared to easily useful tools as Ollama. However, if you are willing to invest in a mistake, work properly, and LLAMA.CPP is your hardware, working hardware and flexibility is appropriate.
One of the greatest benefits of LLAMA.CPP is working well: You do not need a higher running hardware. Works well on normal CPU and laptops without a dedicated GPU, making a local AI accessible to almost everyone. And if you ever need a lot of energy, you can always rent an example of a GPU on a cloud.
Abid Awa (@ 1abidaswan) is a certified scientist for a scientist who likes the machine reading models. Currently, focus on the creation of the content and writing technical blogs in a machine learning and data scientific technology. Avid holds a Master degree in technical management and Bachelor degree in Telecommunication Engineering. His viewpoint builds AI product uses a Graph Neural network for students who strive to be ill.



