Codetum implementation in the END-TOD TransFormer Model Optimization with Gigniging Face Optimum, OnNX Tranne, and Quilinition

nimda September 23, 2025

0 0 4 minutes read

Codetum implementation in the END-TOD TransFormer Model Optimization with Gigniging Face Optimum, OnNX Tranne, and Quilinition

In this lesson, we are traveling by the way we use face-to-face. Get well To make the transformer models and do it immediately when the accuracy is kept. We start by setting a distilbert in the SST-2 Data and compares different murder engines, including pytro and torch.nample, ontime, and an ontime, and ontime, and ontime. In view of this step by step, we find manual experiences by sending models, efficiency, energy construction, and consideration, all between the environment of Google Colab. Look Full codes here.

!pip -q install "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "evaluate>=0.4" accelerate


from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig


os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")


MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR  = Path("onnx-distilbert")
Q_DIR    = Path("onnx-distilbert-quant")
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
BATCH    = 16
MAXLEN   = 128
N_WARM   = 3
N_ITERS  = 8


print(f"Device: {DEVICE} | torch={torch.__version__}")

We start by installing the required libraries and supports our great face-to-face with onnx Runtime. We are preparing for ways, batch size, and the Itemation settings, and confirms that we run on CPU or GPU. Look Full codes here.

ds = load_dataset("glue", "sst2", split="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = evaluate.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)


def make_batches(texts, max_len=MAXLEN, batch=BATCH):
   for i in range(0, len(texts), batch):
       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
                       max_length=max_len, return_tensors="pt")


def run_eval(predict_fn, texts, labels):
   preds = []
   for toks in make_batches(texts):
       preds.extend(predict_fn(toks))
   return metric.compute(predictions=preds, references=labels)["accuracy"]


def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
   for _ in range(n_warm):
       for toks in make_batches(texts[:BATCH*2]):
           predict_fn(toks)
   times = []
   for _ in range(n_iters):
       t0 = time.time()
       for toks in make_batches(texts):
           predict_fn(toks)
       times.append((time.time() - t0) * 1000)
   return float(np.mean(times)), float(np.std(times))

We upload the SST-2 slice to confirm and prepare for the Tokenzation, the metric accuracy, and praising. We explain Run_eval to install accuracy from any forecast and heat for the end and end. With these helpers, we make good comparing different search engines using the same data and raising. Look Full codes here.

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()


@torch.no_grad()
def pt_predict(toks):
   toks = {k: v.to(DEVICE) for k, v in toks.items()}
   logits = torch_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")


compiled_model = torch_model
compile_ok = False
try:
   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
   compile_ok = True
except Exception as e:
   print("torch.compile unavailable or failed -> skipping:", repr(e))


@torch.no_grad()
def ptc_predict(toks):
   toks = {k: v.to(DEVICE) for k, v in toks.items()}
   logits = compiled_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


if compile_ok:
   ptc_ms, ptc_sd = bench(ptc_predict, texts)
   ptc_acc = run_eval(ptc_predict, texts, labels)
   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We download the baseline pytro Classifier, explain the PT_Predict Assistant, and Benchmark / Score It on SST-2. We have been trying to try the graph's shape at once again, if successful, run the same benches to compare speeds and accuracy under the same setup. Look Full codes here.

provider = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
   MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)


@torch.no_grad()
def ort_predict(toks):
   logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
   return logits.argmax(-1).cpu().tolist()


ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")


Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)


ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)


@torch.no_grad()
def ortq_predict(toks):
   logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
   return logits.argmax(-1).cpu().tolist()


oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We send a model to Onnx, use it with an onnx trade, and use dynamic energy with the Ottquantizer OtTirtizer and Benchchmark How to see when the accuracy goes as comparing. Look Full codes here.

pt_pipe  = pipeline("sentiment-analysis", model=torch_model, tokenizer=tokenizer,
                   device=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
   "What a fantastic movie—performed brilliantly!",
   "This was a complete waste of time.",
   "I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
   a = pt_pipe(s)[0]["label"]
   b = ort_pipe(s)[0]["label"]
   print(f"- {s}n  PT={a} | ORT={b}")


import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],
       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
display(df)


print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, hence omitted.
- For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(approach="static") with a calibration set.
""")

We are addressing Sanity forecast for quick hearing pipes and printing pytorch vs onnx labels on the side. We then combine a summary table to compare latency and accuracy engines, including Torm.pile results when available. We conclude with applicable notes, which allows us to increase the work of work in other babies and measures to reduce.

In conclusion, we can see how the optimum helps us to block the gap between the common Pytorch models and production ready for production, well-made shipping. We reach Speeds by OnNX TRAMBAMATIME and when we store accuracy, and check how Tort.chilile provides directly within the Pyterch. This travel moves reflects the effective way of measuring the performance and efficiency of transformer models, providing a basis that can be expanded by the advanced back, such as OpenVino or TensorsorT.

Look Full codes here. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.

Content Collaboration / promotion in MarkTechPost.com, please talk to us

Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

🔥[Recommended Read] NVIDIA AI Open-Spaces Vipe (Video Video Engine): A Powerful and Powerful Tool to Enter the 3D Reference for 3D for Spatial Ai

Source link

nimda September 23, 2025

0 0 4 minutes read