LLMS + PANDAS: How do I use AI products to produce summarfame summaries

Databases and wants speedy understanding without more grinding, reaching the right place.
In 2025, datasets often contain millions of lines and hundreds of columns, making hands-up adjacent. Higher language models can change your raw data statistics and open, readable reports in seconds – for little minutes. This approach removes the boring process of analyzing data on handwriting reports, especially if the data structure does not change.
Pandas handles heavy mammals of data release while the llms changes your technical results into visual reports. You will still need to write down the main statistics from your datasets, but it is one time attempt.
This guide you think Ollama has been installed in your area. If you don't, you can still use LLM transiters, but I will not explain how to connect their APIs.
Content:
- Dataset Introduction and Assessment
- Part of Care: Releasing Summary Statistics
- Cool part: working and llms
- To improve
Dataset Introduction and Assessment
Through this guide, I use the dataset of Musa Womussions from Kaggle. Download if you want to follow.
Dataset is licensed under the Apache 2.0 Licensewhich means you can use your personal and commercial projects.
To get started, you will need a few python information libraries installed on your system.
Once installed all, Import the required libraries required in the new text or writing letter:
import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal
Data loading and restoration
Start by uploading data with pandas. This nyippet is loading the CSV file, printing basic information about data formation, and shows how many missing prices exist in each column:
df = pd.read_csv("data/MBA.csv")
# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)

Since dating data is not the main focus of this article, I will keep the making of the younger ones. The data is a few missing numbers that need attention:
df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")
That's all! Let's see how you can go to this to a meaningful report next.
Part of Care: Releasing Summary Statistics
Or with all the progress that is able to obtain a AI capacity, you may not want to send all your information to the LLM provider. There are a few good reasons why.
Possibly Use too many tokenswhich translates directly to high costs. To process large datasets Take a long timeEspecially when you run models in your area with your hardware. You might face it Sensitive data That should not leave your organization.
Other written functions is the way to go.
This method requires that you have listed the work that releases important things and statistics from your PANDAS data. You will have to list this work from the beginning of different data, but the basic idea transmits easily between projects.
This page get_summary_context_message() The job takes the dataframe and returns one multi-lone strip with detailed summary. Here is included:
- The total amount of application and sexual distribution
- International VS Domersgenct Breakdown
- GPA and GMAT Score Quartizist Statistics
- Actions of an acknowledgment of important education (scheduled by average)
- Prices are not accepted by the work industry (the 8 industries)
- Analysis of work experience with breakage of classification
- The main understanding highlights the categories that are made above
Here is the perfect source code:
def get_summary_context_message(df: pd.DataFrame) -> str:
"""
Generate a comprehensive summary report of MBA admissions dataset statistics.
This function analyzes MBA application data to provide detailed statistics on
applicant demographics, academic performance, professional backgrounds, and
admission rates across various categories. The summary includes gender and
international status distributions, GPA and GMAT score statistics, admission
rates by academic major and work industry, and work experience impact analysis.
Parameters
----------
df : pd.DataFrame
DataFrame containing MBA admissions data with the following expected columns:
- 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
Returns
-------
str
A formatted multi-line string containing comprehensive MBA admissions
statistics.
"""
# Basic application statistics
total_applications = len(df)
# Gender distribution
gender_counts = df["gender"].value_counts()
male_count = gender_counts.get("Male", 0)
female_count = gender_counts.get("Female", 0)
# International status
international_count = (
df["international"].sum()
if df["international"].dtype == bool
else (df["international"] == True).sum()
)
# GPA statistics
gpa_data = df["gpa"].dropna()
gpa_avg = gpa_data.mean()
gpa_25th = gpa_data.quantile(0.25)
gpa_50th = gpa_data.quantile(0.50)
gpa_75th = gpa_data.quantile(0.75)
# GMAT statistics
gmat_data = df["gmat"].dropna()
gmat_avg = gmat_data.mean()
gmat_25th = gmat_data.quantile(0.25)
gmat_50th = gmat_data.quantile(0.50)
gmat_75th = gmat_data.quantile(0.75)
# Major analysis - admission rates by major
major_stats = []
for major in df["major"].unique():
major_data = df[df["major"] == major]
admitted = len(major_data[major_data["admission"] == "Admit"])
total = len(major_data)
rate = (admitted / total) * 100
major_stats.append((major, admitted, total, rate))
# Sort by admission rate (descending)
major_stats.sort(key=lambda x: x[3], reverse=True)
# Work industry analysis - admission rates by industry
industry_stats = []
for industry in df["work_industry"].unique():
if pd.isna(industry):
continue
industry_data = df[df["work_industry"] == industry]
admitted = len(industry_data[industry_data["admission"] == "Admit"])
total = len(industry_data)
rate = (admitted / total) * 100
industry_stats.append((industry, admitted, total, rate))
# Sort by admission rate (descending)
industry_stats.sort(key=lambda x: x[3], reverse=True)
# Work experience analysis
work_exp_data = df["work_exp"].dropna()
avg_work_exp_all = work_exp_data.mean()
# Work experience for admitted students
admitted_students = df[df["admission"] == "Admit"]
admitted_work_exp = admitted_students["work_exp"].dropna()
avg_work_exp_admitted = admitted_work_exp.mean()
# Work experience ranges analysis
def categorize_work_exp(exp):
if pd.isna(exp):
return "Unknown"
elif exp < 2:
return "0-1 years"
elif exp < 4:
return "2-3 years"
elif exp < 6:
return "4-5 years"
elif exp < 8:
return "6-7 years"
else:
return "8+ years"
df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
work_exp_category_stats = []
for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
category_data = df[df["work_exp_category"] == category]
if len(category_data) > 0:
admitted = len(category_data[category_data["admission"] == "Admit"])
total = len(category_data)
rate = (admitted / total) * 100
work_exp_category_stats.append((category, admitted, total, rate))
# Build the summary message
summary = f"""MBA Admissions Dataset Summary (2025)
Total Applications: {total_applications:,} people applied to the MBA program.
Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)
International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)
Academic Performance Statistics:
GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}
GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}
Major Analysis - Admission Rates by Academic Background:"""
for major, admitted, total, rate in major_stats:
summary += (
f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += (
"nnWork Industry Analysis - Admission Rates by Professional Background:"
)
# Show top 8 industries by admission rate
for industry, admitted, total, rate in industry_stats[:8]:
summary += (
f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
summary += (
f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
)
summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"
summary += "nnAdmission Rates by Work Experience Range:"
for category, admitted, total, rate in work_exp_category_stats:
summary += (
f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
# Key insights
best_major = major_stats[0]
best_industry = industry_stats[0]
summary += "nnKey Insights:"
summary += (
f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
)
summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"
if avg_work_exp_admitted > avg_work_exp_all:
summary += f"n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
else:
summary += "n- Work experience shows minimal difference between admitted and all applicants"
return summary
Once you describe the job, simply call it and print results:
print(get_summary_context_message(df))

Now let's move on the sweet part.
Cool part: working and llms
This is where your data is interesting and your data release work pays.
Working Working Work With Wells and LLMS
If you are hardware down, I strongly recommend using local LLMs such as this. I am using Ollama and the latest version of The wrong model by the actual llm performance.

If you want to use something like ChatGPT through Opena Age, you can still do that. You will just need to change the work below to set your API key and restore the correct example in Langchain.
Regardless of whether you choose, the phone on get_llm() With the test message you should not return an error:
def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
"""
Create and configure a ChatOllama instance for local LLM inference.
This function initializes a ChatOllama client configured to connect to a
local Ollama server. The client is set up with deterministic output
(temperature=0) for consistent responses across multiple calls with the
same input.
Parameters
----------
model_name : str, optional
The name of the Ollama model to use for chat completions.
Must be a valid model name that is available on the local Ollama
installation. Default is "mistral:latest".
Returns
-------
ChatOllama
A configured ChatOllama instance ready for chat completions.
"""
return ChatOllama(
model=model_name, base_url=" temperature=0
)
print(get_llm().invoke("test").content)

Faster summary
This is where you can find creating and write the instructions for your llm. I decided to keep things simple for display purposes, but feel comfortable trying here.
There is no one or the right one.
Whatever you do, be sure to install formatting debates using Curly brackets – the prices will automatically be filled with time:
SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.
You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}
This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:
{context}
Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.
Report:
"""
Python work summary
Quickly as well get_llm() Functions are invited, the only item is available to connect the dots. This page get_report_summary() Work takes arguments to fill format bands
You can choose between marks or HTML formats:
def get_report_summary(
dataset: pd.DataFrame,
dataset_name: str,
dataset_source: str,
report_format: Literal["markdown", "html"] = "markdown",
) -> str:
"""
Generate an AI-powered summary report from a pandas DataFrame.
This function analyzes a dataset and generates a comprehensive summary report
using a large language model (LLM). It first extracts statistical context
from the dataset, then uses an LLM to create a human-readable report in the
specified format.
Parameters
----------
dataset : pd.DataFrame
The pandas DataFrame to analyze and summarize.
dataset_name : str
A descriptive name for the dataset that will be included in the
generated report for context and identification.
dataset_source : str
Information about the source or origin of the dataset.
report_format : {"markdown", "html"}, optional
The desired output format for the generated report. Options are:
- "markdown" : Generate report in Markdown format (default)
- "html" : Generate report in HTML format
Returns
-------
str
A formatted summary report.
"""
context_message = get_summary_context_message(df=dataset)
prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
dataset_name=dataset_name,
dataset_source=dataset_source,
context=context_message,
report_format=report_format,
)
return get_llm().invoke(input=prompt).content
Using work is right – just go through the datasette, its name, and the well. Report for marking format:
md_report = get_report_summary(
dataset=df,
dataset_name="MBA Admissions (2025)",
dataset_source="
)
print(md_report)

The HTML report has only details, but you can use some style. Maybe you can ask for a llm to pass that!

To improve
I may easily change this to 30 minutes in a building all the pipe details, but I keep it simple for display purposes. You don't have to (and shouldn't) quit here.
Here are some options to make this Pipeline more power:
- Write a job saving (Markdown or HTML directly directly. This way you can use the whole process and produce program reports without hand-hand access.
- Immediately, Ask the llm to add CSS style to the HTML report To make it look more likely. You can provide the color of your company's product and fonts to agree to all your data reports.
- Expand Pressast to follow specific instructions. You may look for reports that focus on certain business metrics, follow a specific template, or submit compliance according to findings.
- Expand
get_llm()work so they can Connect both to Ollama and other sellers Like Openaai, anthropic, or Google. This gives you flexibility to change between local models and for clouds according to your needs. - Act exactly whatever in Get_summary_Context_Message () work because it works as a basis for all Mongo information given to the LLM. This is where you can find the arts of the Electronic Engineering, Mathematical Analysis, and the understanding of data is important in your specific use case.
I hope this little example sets you on the appropriate track to change your data reporting work.



