The implementation of fully tracking and evaluating the local LLM pipeline using OPIK for transparent, measurable, and re-evolution of AI

nimda November 21, 2025

0 8 5 minutes read

The implementation of fully tracking and evaluating the local LLM pipeline using OPIK for transparent, measurable, and re-evolution of AI

In this tutorial, we use a complete workflow to build, track, and test the LLM pipeline Opik. We plan the program step-by-step, starting with a lightweight model, adding fast-based editing, dataset creation, and finally automated analysis. As we go through each Snippet, we see how Opik helps us track the entire span, visualize pipeline performance, and measure output quality with clear, traceable metrics. In the end, we have a fully coded QA system that we can simplify, compare and view easily. Look Full codes here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up our environment by installing the necessary libraries and starting OPIK. We load the main modules, find the device, and configure our project so that all the traces flow to the right workspace. We lay the whole foundation for teaching. Look Full codes here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load lightweight face huggers and create a small function to help produce clean text. We configure LLM to run locally without external APIs. This gives us a reliable and repeatable generation layer for the entire pipeline. Look Full codes here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2–4 concise sentences.
   """).strip(),
)

We define two formal constraints using the Opik instantiation class. We manage the planning phase and response phase by using clear templates. This helps us to maintain consistency and look at how the planned effects are encouraging? Look Full codes here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We created a small document store with opik track retrieval function as a tool. We allow the pipeline to select a context based on a user query. This allows us to simulate a rag-style workflow without needing an actual vector database. Look Full codes here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:n", qa_pipeline("What does Opik help developers do?"))

We deliver planning, reasoning, and accountability through a fully-tracked LLM pipeline. We hold every step with Opik's designers so we can analyze the spans on the dashboard. By inspecting the pipe, we make sure that everything is connected properly. Look Full codes here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We created and acquired the data set within Opik that our test will use. We include many question mappings that look at various aspects of Opik. This data will serve as ground truth for our QA testing over time. Look Full codes here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define a test function and choose two metrics-aquals and leventhiinratio-to measure the quality of the model. We ensure that the work produces results in the exact format required to achieve goals. This connects our pipeline to Opik's Aclouation Engine. Look Full codes here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("nExperiment URL:", evaluation_result.experiment_url)

We use the Opik test function. We keep killing in order to be strong in colob. When we are done, we get a link to view the test details inside the Opik Dashboard. Look Full codes here.

agg = evaluation_result.aggregate_evaluation_scores()


print("nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We compile and print test scores to understand how our pipeline works. We review the metric results to see where the results align with the benchmarks and where improvements are needed. This closes the loop on our LLM Workflow.

In conclusion, we put a small but fully functional environment. We see that many, inspiring, datasets, and metrics come together to give us a clear view of the model's reasoning process. As we finalize our tests and review aggregate scores, we appreciate how Opik allows us to quickly deploy, systematically test, and reliably verify progress.

Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.

AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

nimda November 21, 2025

0 8 5 minutes read

The implementation of fully tracking and evaluating the local LLM pipeline using OPIK for transparent, measurable, and re-evolution of AI

nimda

Leave a Reply Cancel reply

Subscribers, Revenue, Market Share & Global Reach

5-return back to the base

Gemma 3 270m: Model of a hyper-effective compact of AI

Letting an LLM Pick the Right RAG Page: The Arbiter Pattern at the End of Retrieval

Cut researchers present the work that calls llms: Eliminating SQL relief to improve the accuracy of information and efficiency

OASIS: Simuleringar av social interaction mellan en miljon agent

FALCON 3 models are now available at Amazon Sagemaker Jumpstart

This AI paper introduces codesters: Physical models are symbolic language with code / guide

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

nimda

Subscribe to our mailing list to get the new updates!

Allen Institute for AI (Ai2) Launches OLMO 3: Open Source 7B and 32B LLM Family Built on Dolma 3 and Dolci Stack

Delleplity AI Releases ReferectEngine and PPLX Garden to Run Trillion Parameter LLMS on Existing GPU Clusters

Related Articles

Baidu Releases Unlimited OCR, 3B Model Keeping KV Cache Flat for Long Document Cleaning

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models That Beat gpt-realtime-translate in Accuracy and Latency

How to Design an OpenHarness-Style Agent Runtime with Tools, Memory, Permissions, Capabilities, and Multi-Agent Connectivity

Context Windows Is Not Memory: What AI Agent Developers Need to Understand