Machine Learning

LLMS + PANDAS: How do I use AI products to produce summarfame summaries

Databases and wants speedy understanding without more grinding, reaching the right place.

In 2025, datasets often contain millions of lines and hundreds of columns, making hands-up adjacent. Higher language models can change your raw data statistics and open, readable reports in seconds – for little minutes. This approach removes the boring process of analyzing data on handwriting reports, especially if the data structure does not change.

Pandas handles heavy mammals of data release while the llms changes your technical results into visual reports. You will still need to write down the main statistics from your datasets, but it is one time attempt.

This guide you think Ollama has been installed in your area. If you don't, you can still use LLM transiters, but I will not explain how to connect their APIs.

Content:

  • Dataset Introduction and Assessment
  • Part of Care: Releasing Summary Statistics
  • Cool part: working and llms
  • To improve

Dataset Introduction and Assessment

Through this guide, I use the dataset of Musa Womussions from Kaggle. Download if you want to follow.

Dataset is licensed under the Apache 2.0 Licensewhich means you can use your personal and commercial projects.

To get started, you will need a few python information libraries installed on your system.

Image 1 – Libraries of the required Python and types (Photo by the writer)

Once installed all, Import the required libraries required in the new text or writing letter:

import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal

Data loading and restoration

Start by uploading data with pandas. This nyippet is loading the CSV file, printing basic information about data formation, and shows how many missing prices exist in each column:

df = pd.read_csv("data/MBA.csv")

# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)
Image 2 – Basic Dataset statistics (Photo by the writer)

Since dating data is not the main focus of this article, I will keep the making of the younger ones. The data is a few missing numbers that need attention:

df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")

That's all! Let's see how you can go to this to a meaningful report next.

Part of Care: Releasing Summary Statistics

Or with all the progress that is able to obtain a AI capacity, you may not want to send all your information to the LLM provider. There are a few good reasons why.

Possibly Use too many tokenswhich translates directly to high costs. To process large datasets Take a long timeEspecially when you run models in your area with your hardware. You might face it Sensitive data That should not leave your organization.

Other written functions is the way to go.

This method requires that you have listed the work that releases important things and statistics from your PANDAS data. You will have to list this work from the beginning of different data, but the basic idea transmits easily between projects.

This page get_summary_context_message() The job takes the dataframe and returns one multi-lone strip with detailed summary. Here is included:

  • The total amount of application and sexual distribution
  • International VS Domersgenct Breakdown
  • GPA and GMAT Score Quartizist Statistics
  • Actions of an acknowledgment of important education (scheduled by average)
  • Prices are not accepted by the work industry (the 8 industries)
  • Analysis of work experience with breakage of classification
  • The main understanding highlights the categories that are made above

Here is the perfect source code:

def get_summary_context_message(df: pd.DataFrame) -> str:
    """
    Generate a comprehensive summary report of MBA admissions dataset statistics.
    
    This function analyzes MBA application data to provide detailed statistics on
    applicant demographics, academic performance, professional backgrounds, and
    admission rates across various categories. The summary includes gender and
    international status distributions, GPA and GMAT score statistics, admission
    rates by academic major and work industry, and work experience impact analysis.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing MBA admissions data with the following expected columns:
        - 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
    
    Returns
    -------
    str
        A formatted multi-line string containing comprehensive MBA admissions
        statistics.
    """
    # Basic application statistics
    total_applications = len(df)

    # Gender distribution
    gender_counts = df["gender"].value_counts()
    male_count = gender_counts.get("Male", 0)
    female_count = gender_counts.get("Female", 0)

    # International status
    international_count = (
        df["international"].sum()
        if df["international"].dtype == bool
        else (df["international"] == True).sum()
    )

    # GPA statistics
    gpa_data = df["gpa"].dropna()
    gpa_avg = gpa_data.mean()
    gpa_25th = gpa_data.quantile(0.25)
    gpa_50th = gpa_data.quantile(0.50)
    gpa_75th = gpa_data.quantile(0.75)

    # GMAT statistics
    gmat_data = df["gmat"].dropna()
    gmat_avg = gmat_data.mean()
    gmat_25th = gmat_data.quantile(0.25)
    gmat_50th = gmat_data.quantile(0.50)
    gmat_75th = gmat_data.quantile(0.75)

    # Major analysis - admission rates by major
    major_stats = []
    for major in df["major"].unique():
        major_data = df[df["major"] == major]
        admitted = len(major_data[major_data["admission"] == "Admit"])
        total = len(major_data)
        rate = (admitted / total) * 100
        major_stats.append((major, admitted, total, rate))

    # Sort by admission rate (descending)
    major_stats.sort(key=lambda x: x[3], reverse=True)

    # Work industry analysis - admission rates by industry
    industry_stats = []
    for industry in df["work_industry"].unique():
        if pd.isna(industry):
            continue
        industry_data = df[df["work_industry"] == industry]
        admitted = len(industry_data[industry_data["admission"] == "Admit"])
        total = len(industry_data)
        rate = (admitted / total) * 100
        industry_stats.append((industry, admitted, total, rate))

    # Sort by admission rate (descending)
    industry_stats.sort(key=lambda x: x[3], reverse=True)

    # Work experience analysis
    work_exp_data = df["work_exp"].dropna()
    avg_work_exp_all = work_exp_data.mean()

    # Work experience for admitted students
    admitted_students = df[df["admission"] == "Admit"]
    admitted_work_exp = admitted_students["work_exp"].dropna()
    avg_work_exp_admitted = admitted_work_exp.mean()

    # Work experience ranges analysis
    def categorize_work_exp(exp):
        if pd.isna(exp):
            return "Unknown"
        elif exp < 2:
            return "0-1 years"
        elif exp < 4:
            return "2-3 years"
        elif exp < 6:
            return "4-5 years"
        elif exp < 8:
            return "6-7 years"
        else:
            return "8+ years"

    df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
    work_exp_category_stats = []

    for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
        category_data = df[df["work_exp_category"] == category]
        if len(category_data) > 0:
            admitted = len(category_data[category_data["admission"] == "Admit"])
            total = len(category_data)
            rate = (admitted / total) * 100
            work_exp_category_stats.append((category, admitted, total, rate))

    # Build the summary message
    summary = f"""MBA Admissions Dataset Summary (2025)
    
Total Applications: {total_applications:,} people applied to the MBA program.

Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)

International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)

Academic Performance Statistics:

GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}

GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}

Major Analysis - Admission Rates by Academic Background:"""

    for major, admitted, total, rate in major_stats:
        summary += (
            f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += (
        "nnWork Industry Analysis - Admission Rates by Professional Background:"
    )

    # Show top 8 industries by admission rate
    for industry, admitted, total, rate in industry_stats[:8]:
        summary += (
            f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
    summary += (
        f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
    )
    summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"

    summary += "nnAdmission Rates by Work Experience Range:"
    for category, admitted, total, rate in work_exp_category_stats:
        summary += (
            f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    # Key insights
    best_major = major_stats[0]
    best_industry = industry_stats[0]

    summary += "nnKey Insights:"
    summary += (
        f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
    )
    summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"

    if avg_work_exp_admitted > avg_work_exp_all:
        summary += f"n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
    else:
        summary += "n- Work experience shows minimal difference between admitted and all applicants"

    return summary

Once you describe the job, simply call it and print results:

print(get_summary_context_message(df))
Pictures 3 – Released Findings and Mathematics from Database (Photo by Author)

Now let's move on the sweet part.

Cool part: working and llms

This is where your data is interesting and your data release work pays.

Working Working Work With Wells and LLMS

If you are hardware down, I strongly recommend using local LLMs such as this. I am using Ollama and the latest version of The wrong model by the actual llm performance.

Image 4 – The Models Available for Ollama (Photo by the writer)

If you want to use something like ChatGPT through Opena Age, you can still do that. You will just need to change the work below to set your API key and restore the correct example in Langchain.

Regardless of whether you choose, the phone on get_llm() With the test message you should not return an error:

def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
    """
    Create and configure a ChatOllama instance for local LLM inference.
    
    This function initializes a ChatOllama client configured to connect to a
    local Ollama server. The client is set up with deterministic output
    (temperature=0) for consistent responses across multiple calls with the
    same input.
    
    Parameters
    ----------
    model_name : str, optional
        The name of the Ollama model to use for chat completions.
        Must be a valid model name that is available on the local Ollama
        installation. Default is "mistral:latest".
    
    Returns
    -------
    ChatOllama
        A configured ChatOllama instance ready for chat completions.
    """
    return ChatOllama(
        model=model_name, base_url=" temperature=0
    )


print(get_llm().invoke("test").content)
Picture 5 – LLM test message (Photo by the writer)

Faster summary

This is where you can find creating and write the instructions for your llm. I decided to keep things simple for display purposes, but feel comfortable trying here.

There is no one or the right one.

Whatever you do, be sure to install formatting debates using Curly brackets – the prices will automatically be filled with time:

SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.

You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}

This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:


{context}


Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.

Report:
"""

Python work summary

Quickly as well get_llm() Functions are invited, the only item is available to connect the dots. This page get_report_summary() Work takes arguments to fill format bands

You can choose between marks or HTML formats:

def get_report_summary(
    dataset: pd.DataFrame,
    dataset_name: str,
    dataset_source: str,
    report_format: Literal["markdown", "html"] = "markdown",
) -> str:
    """
    Generate an AI-powered summary report from a pandas DataFrame.
    
    This function analyzes a dataset and generates a comprehensive summary report
    using a large language model (LLM). It first extracts statistical context
    from the dataset, then uses an LLM to create a human-readable report in the
    specified format.
    
    Parameters
    ----------
    dataset : pd.DataFrame
        The pandas DataFrame to analyze and summarize.
    dataset_name : str
        A descriptive name for the dataset that will be included in the
        generated report for context and identification.
    dataset_source : str
        Information about the source or origin of the dataset.
    report_format : {"markdown", "html"}, optional
        The desired output format for the generated report. Options are:
        - "markdown" : Generate report in Markdown format (default)
        - "html" : Generate report in HTML format
    
    Returns
    -------
    str
        A formatted summary report.
    
    """
    context_message = get_summary_context_message(df=dataset)
    prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
        dataset_name=dataset_name,
        dataset_source=dataset_source,
        context=context_message,
        report_format=report_format,
    )
    return get_llm().invoke(input=prompt).content

Using work is right – just go through the datasette, its name, and the well. Report for marking format:

md_report = get_report_summary(
    dataset=df, 
    dataset_name="MBA Admissions (2025)",
    dataset_source="
)
print(md_report)
Image 4 – The last report about Markown format (Photo by the writer)

The HTML report has only details, but you can use some style. Maybe you can ask for a llm to pass that!

Image 7 – Last report in HTML format (Photo by the writer)

To improve

I may easily change this to 30 minutes in a building all the pipe details, but I keep it simple for display purposes. You don't have to (and shouldn't) quit here.

Here are some options to make this Pipeline more power:

  • Write a job saving (Markdown or HTML directly directly. This way you can use the whole process and produce program reports without hand-hand access.
  • Immediately, Ask the llm to add CSS style to the HTML report To make it look more likely. You can provide the color of your company's product and fonts to agree to all your data reports.
  • Expand Pressast to follow specific instructions. You may look for reports that focus on certain business metrics, follow a specific template, or submit compliance according to findings.
  • Expand get_llm() work so they can Connect both to Ollama and other sellers Like Openaai, anthropic, or Google. This gives you flexibility to change between local models and for clouds according to your needs.
  • Act exactly whatever in Get_summary_Context_Message () work because it works as a basis for all Mongo information given to the LLM. This is where you can find the arts of the Electronic Engineering, Mathematical Analysis, and the understanding of data is important in your specific use case.

I hope this little example sets you on the appropriate track to change your data reporting work.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button