The implementation of fully tracking and evaluating the local LLM pipeline using OPIK for transparent, measurable, and re-evolution of AI

In this tutorial, we use a complete workflow to build, track, and test the LLM pipeline Opik. We plan the program step-by-step, starting with a lightweight model, adding fast-based editing, dataset creation, and finally automated analysis. As we go through each Snippet, we see how Opik helps us track the entire span, visualize pipeline performance, and measure output quality with clear, traceable metrics. In the end, we have a fully coded QA system that we can simplify, compare and view easily. Look Full codes here.
!pip install -q opik transformers accelerate torch
import torch
from transformers import pipeline
import textwrap
import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio
device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")
opik.configure()
PROJECT_NAME = "opik-hf-tutorial"
We set up our environment by installing the necessary libraries and starting OPIK. We load the main modules, find the device, and configure our project so that all the traces flow to the right workspace. We lay the whole foundation for teaching. Look Full codes here.
llm = pipeline(
"text-generation",
model="distilgpt2",
device=device,
)
def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
result = llm(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
pad_token_id=llm.tokenizer.eos_token_id,
)[0]["generated_text"]
return result[len(prompt):].strip()
We load lightweight face huggers and create a small function to help produce clean text. We configure LLM to run locally without external APIs. This gives us a reliable and repeatable generation layer for the entire pipeline. Look Full codes here.
plan_prompt = Prompt(
name="hf_plan_prompt",
prompt=textwrap.dedent("""
You are an assistant that creates a plan to answer a question
using ONLY the given context.
Context:
{{context}}
Question:
{{question}}
Return exactly 3 bullet points as a plan.
""").strip(),
)
answer_prompt = Prompt(
name="hf_answer_prompt",
prompt=textwrap.dedent("""
You answer based only on the given context.
Context:
{{context}}
Question:
{{question}}
Plan:
{{plan}}
Answer the question in 2–4 concise sentences.
""").strip(),
)
We define two formal constraints using the Opik instantiation class. We manage the planning phase and response phase by using clear templates. This helps us to maintain consistency and look at how the planned effects are encouraging? Look Full codes here.
DOCS = {
"overview": """
Opik is an open-source platform for debugging, evaluating,
and monitoring LLM and RAG applications. It provides tracing,
datasets, experiments, and evaluation metrics.
""",
"tracing": """
Tracing in Opik logs nested spans, LLM calls, token usage,
feedback scores, and metadata to inspect complex LLM pipelines.
""",
"evaluation": """
Opik evaluations are defined by datasets, evaluation tasks,
scoring metrics, and experiments that aggregate scores,
helping detect regressions or issues.
""",
}
@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
q = question.lower()
if "trace" in q or "span" in q:
return DOCS["tracing"]
if "metric" in q or "dataset" in q or "evaluate" in q:
return DOCS["evaluation"]
return DOCS["overview"]
We created a small document store with opik track retrieval function as a tool. We allow the pipeline to select a context based on a user query. This allows us to simulate a rag-style workflow without needing an actual vector database. Look Full codes here.
@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
rendered = plan_prompt.format(context=context, question=question)
return hf_generate(rendered, max_new_tokens=80)
@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
rendered = answer_prompt.format(
context=context,
question=question,
plan=plan,
)
return hf_generate(rendered, max_new_tokens=120)
@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
context = retrieve_context(question)
plan = plan_answer(context, question)
answer = answer_from_plan(context, question, plan)
return answer
print("Sample answer:n", qa_pipeline("What does Opik help developers do?"))
We deliver planning, reasoning, and accountability through a fully-tracked LLM pipeline. We hold every step with Opik's designers so we can analyze the spans on the dashboard. By inspecting the pipe, we make sure that everything is connected properly. Look Full codes here.
client = Opik()
dataset = client.get_or_create_dataset(
name="HF_Opik_QA_Dataset",
description="Small QA dataset for HF + Opik tutorial",
)
dataset.insert([
{
"question": "What kind of platform is Opik?",
"context": DOCS["overview"],
"reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
},
{
"question": "What does tracing in Opik log?",
"context": DOCS["tracing"],
"reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
},
{
"question": "What are the components of an Opik evaluation?",
"context": DOCS["evaluation"],
"reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
},
])
We created and acquired the data set within Opik that our test will use. We include many question mappings that look at various aspects of Opik. This data will serve as ground truth for our QA testing over time. Look Full codes here.
equals_metric = Equals()
lev_metric = LevenshteinRatio()
def evaluation_task(item: dict) -> dict:
output = qa_pipeline(item["question"])
return {
"output": output,
"reference": item["reference"],
}
We define a test function and choose two metrics-aquals and leventhiinratio-to measure the quality of the model. We ensure that the work produces results in the exact format required to achieve goals. This connects our pipeline to Opik's Aclouation Engine. Look Full codes here.
evaluation_result = evaluate(
dataset=dataset,
task=evaluation_task,
scoring_metrics=[equals_metric, lev_metric],
experiment_name="HF_Opik_QA_Experiment",
project_name=PROJECT_NAME,
task_threads=1,
)
print("nExperiment URL:", evaluation_result.experiment_url)
We use the Opik test function. We keep killing in order to be strong in colob. When we are done, we get a link to view the test details inside the Opik Dashboard. Look Full codes here.
agg = evaluation_result.aggregate_evaluation_scores()
print("nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
print(metric_name, "=>", stats)
We compile and print test scores to understand how our pipeline works. We review the metric results to see where the results align with the benchmarks and where improvements are needed. This closes the loop on our LLM Workflow.
In conclusion, we put a small but fully functional environment. We see that many, inspiring, datasets, and metrics come together to give us a clear view of the model's reasoning process. As we finalize our tests and review aggregate scores, we appreciate how Opik allows us to quickly deploy, systematically test, and reliably verify progress.
Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



