Implementation of LLM Coded by LLM managed by LLM-owned LLM with Ollama, Pest API, and Gradio Chat Interface

In this lesson, we use full performance Ollama Nature within Google Colab or LLAMA3.2: 1B, which is limited work barriers with only use of the CPU area. To contact these models in order, using the / API / EPPoint Endpoint with Python's applications in the broadcast importation. Finally, we set up a The growing moonBased UI over this customer so that we can disassemble the issues, maintaining a multi-turning history, preparing parameters such as temperature and temperature, and views the results in real time. Look Full codes here.
import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path
def sh(cmd, check=True):
"""Run a shell command, stream output."""
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for line in p.stdout:
print(line, end="")
p.wait()
if check and p.returncode != 0:
raise RuntimeError(f"Command failed: {cmd}")
if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
print("🔧 Installing Ollama ...")
sh("curl -fsSL | sh")
else:
print("✅ Ollama already installed.")
try:
import gradio
except Exception:
print("🔧 Installing Gradio ...")
sh("pip -q install gradio==4.44.0")
We begin to check that Ollama is already included in the system, and if not, we include using the official text. At the same time, we ensure that Gradio is available by importing or installing the required version when it is not available. In this way, we prepare our Colob environment for use in the interface. Look Full codes here.
def start_ollama():
try:
requests.get(" timeout=1)
print("✅ Ollama server already running.")
return None
except Exception:
pass
print("🚀 Starting Ollama server ...")
proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
for _ in range(60):
time.sleep(1)
try:
r = requests.get(" timeout=1)
if r.ok:
print("✅ Ollama server is up.")
break
except Exception:
pass
else:
raise RuntimeError("Ollama did not start in time.")
return proc
server_proc = start_ollama()
We start the Ollama server on the back and continue to check the end of life until you have successfully responded. By doing this, we ensure that the server is active and ready before sending any API applications. Look Full codes here.
MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f"🧠 Using model: {MODEL}")
try:
tags = requests.get(" timeout=5).json()
have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
have = False
if not have:
print(f"⬇️ Pulling model {MODEL} (first time only) ...")
sh(f"ollama pull {MODEL}")
It describes the default model to use, check that it is already available on the Ollama server, and if not, we automatically drag. This ensures that the selected model is ready before we start using any interview times. Look Full codes here.
OLLAMA_URL = "
def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
"""Yield streaming text chunks from Ollama /api/chat."""
payload = {
"model": model,
"messages": messages,
"stream": True,
"options": {"temperature": float(temperature)}
}
if num_ctx:
payload["options"]["num_ctx"] = int(num_ctx)
with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
r.raise_for_status()
for line in r.iter_lines():
if not line:
continue
data = json.loads(line.decode("utf-8"))
if "message" in data and "content" in data["message"]:
yield data["message"]["content"]
if data.get("done"):
break
We create a distribution of Ollama / API / ENDPoint ENDPOINT, where we send messages as JSON to load and give tokens as they arrive. This allows to handle the answers more, so we see the effect of the model in real time instead of waiting for full completion. Look Full codes here.
def smoke_test():
print("n🧪 Smoke test:")
sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
out = []
for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
print(chunk, end="")
out.append(chunk)
print("n🧪 Done.n")
try:
smoke_test()
except Exception as e:
print("⚠️ Smoke test skipped:", e)
We use a speedy smoke test by speeding simple with our distribution client to ensure the model responds correctly. This helps us to make sure that the Ollama is included, the server is active, and the selected model is active before building the perfect Chat E-Chat. Look Full codes here.
import gradio as gr
SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."
def chat_fn(message, history, temperature, num_ctx):
msgs = [{"role":"system","content":SYSTEM_PROMPT}]
for u, a in history:
if u: msgs.append({"role":"user","content":u})
if a: msgs.append({"role":"assistant","content":a})
msgs.append({"role":"user","content": message})
acc = ""
try:
for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
acc += part
yield acc
except Exception as e:
yield f"⚠️ Error: {e}"
with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
gr.Markdown("# 🦙 Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
with gr.Row():
temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
chat = gr.Chatbot(height=460)
msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
clear = gr.Button("Clear")
def user_send(m, h):
m = (m or "").strip()
if not m: return "", h
return "", h + [[m, None]]
def bot_reply(h, temperature, num_ctx):
u = h[-1][0]
stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
acc = ""
for partial in stream:
acc = partial
h[-1][1] = acc
yield h
msg.submit(user_send, [msg, chat], [msg, chat])
.then(bot_reply, [chat, temp, num_ctx], [chat])
clear.click(lambda: None, None, chat)
print("🌐 Launching Gradio ...")
demo.launch(share=True)
It includes gradio Chat UI contacts Ollama Server, where user's input and chat history is converted to proper messaging and broadcast as model answers. Slides Let's use parameters such as temperature and model's length, while the dialog box and a clear button offers a simple, real time to test different discharge.
In conclusion, we establish a rich pipeline Ollama on Colab: Including, the start of the Server, API access, and visual user integration, as well as the visible user integration, as well as the visual user integration. The program uses Ollama's Rest Age as a core layer to meet, providing an access to the command and the spread of Python, and gradio is heavy in repulsibility and interview. This approach keeps “the Trade” design described in the first directory but to adapt the issues of Colab, where Docker and GPU is supported by the Ollama. The result is a comprehensive but fully framework for the technology that allows us to test multiple llms, fix the dynamic energy, and check AI in place of the booklet.
Look Full codes here. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.
Sana Hassan, a contact in MarktechPost with a student of the Dual-degree student in the IIit Madras, loves to use technology and ai to deal with the real challenges of the world. I'm very interested in solving practical problems, brings a new view of ai solution to AI and real solutions.



