Generative AI

How AutoGluon Enables Modern AutoML Pipelines for Production Grade Table Models with Merge and Extraction

In this tutorial, we build a machine learning pipeline for a production grade table using AutoGluontaking real-world mixed-type datasets from raw inputs to ready-to-use artifacts. We train high-quality stacked and bagged ensembles, evaluate performance with robust metrics, perform subgroup and feature-level analyses, and optimize the model for real-time inference using refit-full and distillation. In every workflow, we focus on practical decisions that measure accuracy, latency, and utilization. Check out FULL CODES here.

!pip -q install -U "autogluon==1.5.0" "scikit-learn>=1.3" "pandas>=2.0" "numpy>=1.24"


import os, time, json, warnings
warnings.filterwarnings("ignore")


import numpy as np
import pandas as pd


from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score, classification_report, confusion_matrix


from autogluon.tabular import TabularPredictor

We set up the environment by installing the necessary libraries and importing all the dependencies used throughout the system. We prepare alerts to keep the output clean and ensure that numerical, tabular, and analytical resources are correct. Check out FULL CODES here.

from sklearn.datasets import fetch_openml
df = fetch_openml(data_id=40945, as_frame=True).frame


target = "survived"
df[target] = df[target].astype(int)


drop_cols = [c for c in ["boat", "body", "home.dest"] if c in df.columns]
df = df.drop(columns=drop_cols, errors="ignore")


df = df.replace({None: np.nan})
print("Shape:", df.shape)
print("Target positive rate:", df[target].mean().round(4))
print("Columns:", list(df.columns))


train_df, test_df = train_test_split(
   df,
   test_size=0.2,
   random_state=42,
   stratify=df[target],
)

We load a real-world mixed-type dataset and perform light processing to prepare a clean training signal. We define the target, remove the most leaky columns, and verify the structure of the dataset. We then create a stratified train test division to maintain class balance. Check out FULL CODES here.

def has_gpu():
   try:
       import torch
       return torch.cuda.is_available()
   except Exception:
       return False


presets = "extreme" if has_gpu() else "best_quality"


save_path = "/content/autogluon_titanic_advanced"
os.makedirs(save_path, exist_ok=True)


predictor = TabularPredictor(
   label=target,
   eval_metric="roc_auc",
   path=save_path,
   verbosity=2
)

We recognize hardware availability to dynamically select the most appropriate AutoGluon training preset. We prepare the directory for the continuous model and run the forecast table with the appropriate test metric. Check out FULL CODES here.

start = time.time()
predictor.fit(
   train_data=train_df,
   presets=presets,
   time_limit=7 * 60,
   num_bag_folds=5,
   num_stack_levels=2,
   refit_full=False
)
train_time = time.time() - start
print(f"nTraining done in {train_time:.1f}s with presets="{presets}"")

We train a high quality collection using bags and packaging within a controlled time budget. We rely on AutoGluon's automated model analysis to effectively evaluate robust structures. We also record the training time to understand the computation cost. Check out FULL CODES here.

lb = predictor.leaderboard(test_df, silent=True)
print("n=== Leaderboard (top 15) ===")
display(lb.head(15))


proba = predictor.predict_proba(test_df)
pred = predictor.predict(test_df)


y_true = test_df[target].values
if isinstance(proba, pd.DataFrame) and 1 in proba.columns:
   y_proba = proba[1].values
else:
   y_proba = np.asarray(proba).reshape(-1)


print("n=== Test Metrics ===")
print("ROC-AUC:", roc_auc_score(y_true, y_proba).round(5))
print("LogLoss:", log_loss(y_true, np.clip(y_proba, 1e-6, 1 - 1e-6)).round(5))
print("Accuracy:", accuracy_score(y_true, pred).round(5))
print("nClassification report:n", classification_report(y_true, pred))

We test the trained models using a delayed test set and check the leaderboard to compare performance. We calculate probabilistic and imprecise predictors and derive important classification metrics. It gives us a broader view of model accuracy and measurement. Check out FULL CODES here.

if "pclass" in test_df.columns:
   print("n=== Slice AUC by pclass ===")
   for grp, part in test_df.groupby("pclass"):
       part_proba = predictor.predict_proba(part)
       part_proba = part_proba[1].values if isinstance(part_proba, pd.DataFrame) and 1 in part_proba.columns else np.asarray(part_proba).reshape(-1)
       auc = roc_auc_score(part[target].values, part_proba)
       print(f"pclass={grp}: AUC={auc:.4f} (n={len(part)})")


fi = predictor.feature_importance(test_df, silent=True)
print("n=== Feature importance (top 20) ===")
display(fi.head(20))

We analyze the behavior of the model by using the performance cut of the subgroup and the importance of the feature based on the permutation. We see how performance varies across logical segments of data. It helps us to check stability and interpretability before shipping. Check out FULL CODES here.

t0 = time.time()
refit_map = predictor.refit_full()
t_refit = time.time() - t0


print(f"nrefit_full completed in {t_refit:.1f}s")
print("Refit mapping (sample):", dict(list(refit_map.items())[:5]))


lb_full = predictor.leaderboard(test_df, silent=True)
print("n=== Leaderboard after refit_full (top 15) ===")
display(lb_full.head(15))


best_model = predictor.get_model_best()
full_candidates = [m for m in predictor.get_model_names() if m.endswith("_FULL")]


def bench_infer(model_name, df_in, repeats=3):
   times = []
   for _ in range(repeats):
       t1 = time.time()
       _ = predictor.predict(df_in, model=model_name)
       times.append(time.time() - t1)
   return float(np.median(times))


small_batch = test_df.drop(columns=[target]).head(256)
lat_best = bench_infer(best_model, small_batch)
print(f"nBest model: {best_model} | median predict() latency on 256 rows: {lat_best:.4f}s")


if full_candidates:
   lb_full_sorted = lb_full.sort_values(by="score_test", ascending=False)
   best_full = lb_full_sorted[lb_full_sorted["model"].str.endswith("_FULL")].iloc[0]["model"]
   lat_full = bench_infer(best_full, small_batch)
   print(f"Best FULL model: {best_full} | median predict() latency on 256 rows: {lat_full:.4f}s")
   print(f"Speedup factor (best / full): {lat_best / max(lat_full, 1e-9):.2f}x")


try:
   t0 = time.time()
   distill_result = predictor.distill(
       train_data=train_df,
       time_limit=4 * 60,
       augment_method="spunge",
   )
   t_distill = time.time() - t0
   print(f"nDistillation completed in {t_distill:.1f}s")
except Exception as e:
   print("nDistillation step failed")
   print("Error:", repr(e))


lb2 = predictor.leaderboard(test_df, silent=True)
print("n=== Leaderboard after distillation attempt (top 20) ===")
display(lb2.head(20))


predictor.save()
reloaded = TabularPredictor.load(save_path)


sample = test_df.drop(columns=[target]).sample(8, random_state=0)
sample_pred = reloaded.predict(sample)
sample_proba = reloaded.predict_proba(sample)


print("n=== Reloaded predictor sanity-check ===")
print(sample.assign(pred=sample_pred).head())


print("nProbabilities (head):")
display(sample_proba.head())


artifacts = {
   "path": save_path,
   "presets": presets,
   "best_model": reloaded.get_model_best(),
   "model_names": reloaded.get_model_names(),
   "leaderboard_top10": lb2.head(10).to_dict(orient="records"),
}
with open(os.path.join(save_path, "run_summary.json"), "w") as f:
   json.dump(artifacts, f, indent=2)


print("nSaved summary to:", os.path.join(save_path, "run_summary.json"))
print("Done.")

We prepare a trained ensemble to consider the collapse of bagged models and the improvement of measurement delays. We voluntarily break down the combination into fast models and verify persistence with reload tests. Also, we export structured artifacts required for production deployment.

In conclusion, we implemented an end-to-end workflow with AutoGluon that transforms raw tabular data into production-ready models with minimal manual intervention, while maintaining tight control over the accuracy, robustness, and efficiency of the views. We performed systematic error analysis and feature importance testing, optimized large ensembles through refactoring and filtering, and verified deployment readiness using latency benchmarking and artifact packaging. This workflow allows for the deployment of tabular models that are highly efficient, scalable, interpretable, and well-suited to real-world production environments.


Check out FULL CODES here. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button