Generative AI

Implementation of Coding in Document Parse Benchmarking with LlamaIndex ParseBench Using Python, Face Hugs, and Test Benchmarks

In this tutorial, we explore how to use the ParseBench A dataset for testing document classification systems in a systematic, realistic way. We start by loading the dataset directly into Hugging Face, explore its multiple dimensions, such as text, tables, charts, and layout, and convert it into a unified data frame for in-depth analysis. As we progress, we identify key fields, find linked PDFs, and build a lightweight foundation using PyMuPDF to extract and compare text. Throughout this process, we focus on creating a flexible pipeline that allows us to understand the schema of the dataset, check the quality of the analysis, and prepare the input for more advanced OCR or vision language models.

Copy the CodeCopiedUse a different browser
!pip install -q -U datasets huggingface_hub pandas matplotlib rich pymupdf rapidfuzz tqdm


import json, re, textwrap, random, math
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from huggingface_hub import hf_hub_download, list_repo_files
from rapidfuzz import fuzz
import fitz


console = Console()
DATASET_ID = "llamaindex/ParseBench"
WORKDIR = Path("/content/parsebench_tutorial")
WORKDIR.mkdir(parents=True, exist_ok=True)


console.print(Panel.fit("Advanced ParseBench Tutorial on Google Colab", style="bold green"))


files = list_repo_files(DATASET_ID, repo_type="dataset")
jsonl_files = [f for f in files if f.endswith(".jsonl")]
pdf_files = [f for f in files if f.endswith(".pdf")]


console.print(f"Found {len(jsonl_files)} JSONL files")
console.print(f"Found {len(pdf_files)} PDF files")


table = Table(title="ParseBench JSONL Files")
table.add_column("File")
table.add_column("Dimension")
for f in jsonl_files:
   table.add_row(f, Path(f).stem)
console.print(table)

We install all the necessary libraries and set up our workstation for learning. We initialize the dataset source and configure the workspace to store all results. We also download and list all JSONL and PDF files in the ParseBench repository to understand the structure of the dataset.

Copy the CodeCopiedUse a different browser
def load_jsonl_from_hf(filename, max_rows=None):
   path = hf_hub_download(repo_id=DATASET_ID, filename=filename, repo_type="dataset")
   rows = []
   with open(path, "r", encoding="utf-8") as fp:
       for i, line in enumerate(fp):
           if max_rows and i >= max_rows:
               break
           line = line.strip()
           if line:
               rows.append(json.loads(line))
   return rows, path


def flatten_dict(d, parent_key="", sep="."):
   items = {}
   if isinstance(d, dict):
       for k, v in d.items():
           new_key = f"{parent_key}{sep}{k}" if parent_key else str(k)
           if isinstance(v, dict):
               items.update(flatten_dict(v, new_key, sep=sep))
           else:
               items[new_key] = v
   return items


dimension_data = {}
for jf in jsonl_files:
   rows, local_path = load_jsonl_from_hf(jf)
   dimension_data[Path(jf).stem] = rows
   console.print(f"{jf}: {len(rows)} examples loaded")


summary_rows = []
for dim, rows in dimension_data.items():
   keys = Counter()
   for r in rows[:100]:
       keys.update(flatten_dict(r).keys())
   summary_rows.append({
       "dimension": dim,
       "examples": len(rows),
       "top_fields": ", ".join([k for k, _ in keys.most_common(12)])
   })


summary_df = pd.DataFrame(summary_rows)
display(summary_df)


plt.figure(figsize=(10, 5))
plt.bar(summary_df["dimension"], summary_df["examples"])
plt.title("ParseBench Examples by Dimension")
plt.xlabel("Dimension")
plt.ylabel("Number of Examples")
plt.xticks(rotation=30, ha="right")
plt.show()


for dim, rows in dimension_data.items():
   console.print(Panel.fit(f"Sample schema for {dim}", style="bold cyan"))
   if rows:
       console.print(json.dumps(rows[0], indent=2)[:3000])

We load JSONL files from datasets and convert them to Python executables. We flatten nested structures for easy analysis in tabular format. We also summarize some dimensions and visualize the distribution of examples in different analysis tasks.

Copy the CodeCopiedUse a different browser
all_records = []
for dim, rows in dimension_data.items():
   for i, r in enumerate(rows):
       flat = flatten_dict(r)
       flat["_dimension"] = dim
       flat["_row_id"] = i
       all_records.append(flat)


df = pd.DataFrame(all_records)
console.print(f"Combined dataframe shape: {df.shape}")
display(df.head())


missing_report = []
for col in df.columns:
   missing_report.append({
       "column": col,
       "non_null": int(df[col].notna().sum()),
       "missing": int(df[col].isna().sum()),
       "coverage_pct": round(100 * df[col].notna().mean(), 2)
   })


missing_df = pd.DataFrame(missing_report).sort_values("coverage_pct", ascending=False)
display(missing_df.head(40))


def find_candidate_columns(df, keywords):
   cols = []
   for c in df.columns:
       lc = c.lower()
       if any(k.lower() in lc for k in keywords):
           cols.append(c)
   return cols


doc_cols = find_candidate_columns(df, ["doc", "pdf", "file", "path", "source", "image"])
text_cols = find_candidate_columns(df, ["text", "content", "markdown", "ground", "answer", "expected", "target", "reference"])
rule_cols = find_candidate_columns(df, ["rule", "check", "assert", "criteria", "question", "prompt"])
bbox_cols = find_candidate_columns(df, ["bbox", "box", "polygon", "coordinates", "layout"])


console.print("[bold]Possible document columns:[/bold]", doc_cols[:30])
console.print("[bold]Possible text/reference columns:[/bold]", text_cols[:30])
console.print("[bold]Possible rule/question columns:[/bold]", rule_cols[:30])
console.print("[bold]Possible layout columns:[/bold]", bbox_cols[:30])

We combine all analyzed records into a single data frame for integrated analysis. We check for missing values ​​and identify which fields are most informative across the dataset. We also find candidate columns related to documents, text, rules, and structure to guide downstream processing.

Copy the CodeCopiedUse a different browser
def pick_first_existing(row, candidates):
   for c in candidates:
       if c in row and pd.notna(row[c]):
           value = row[c]
           if isinstance(value, str) and value.strip():
               return value
           if not isinstance(value, str):
               return value
   return None


def normalize_text(x):
   if x is None or (isinstance(x, float) and math.isnan(x)):
       return ""
   x = str(x)
   x = re.sub(r"s+", " ", x)
   return x.strip().lower()


def simple_text_similarity(a, b):
   a = normalize_text(a)
   b = normalize_text(b)
   if not a or not b:
       return None
   return fuzz.token_set_ratio(a, b) / 100


def locate_pdf_path(value):
   if value is None:
       return None
   value = str(value)
   candidates = []
   if value.endswith(".pdf"):
       candidates.append(value)
       candidates.extend([f for f in pdf_files if f.endswith(value.split("/")[-1])])
   else:
       candidates.extend([
           f for f in pdf_files
           if value in f or Path(f).stem in value or value in Path(f).stem
       ])
   return candidates[0] if candidates else None


def extract_pdf_text_from_hf(pdf_repo_path, max_pages=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   texts = []
   for page_idx in range(min(max_pages, len(doc))):
       texts.append(doc[page_idx].get_text("text"))
   doc.close()
   return "n".join(texts), local_pdf


def render_pdf_first_page(pdf_repo_path, zoom=2):
   local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type="dataset")
   doc = fitz.open(local_pdf)
   page = doc[0]
   pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))
   out_path = WORKDIR / (Path(pdf_repo_path).stem + "_page1.png")
   pix.save(out_path)
   doc.close()
   return out_path


sample_records = df.sample(min(25, len(df)), random_state=42).to_dict("records")
pdf_candidates = []


for row in sample_records:
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           pdf_candidates.append((row["_dimension"], row["_row_id"], pdf_path))
           break


pdf_candidates = list(dict.fromkeys(pdf_candidates))
console.print(f"Detected {len(pdf_candidates)} PDF-linked sampled records")


if pdf_candidates:
   dim, row_id, pdf_path = pdf_candidates[0]
   console.print(Panel.fit(f"Rendering sample PDFnDimension: {dim}nRow: {row_id}nPDF: {pdf_path}", style="bold yellow"))
   image_path = render_pdf_first_page(pdf_path)
   img = plt.imread(image_path)
   plt.figure(figsize=(10, 12))
   plt.imshow(img)
   plt.axis("off")
   plt.title(f"{dim}: {Path(pdf_path).name}")
   plt.show()
else:
   console.print("[yellow]No PDF-linked rows were detected from the sample.[/yellow]")

We describe the helper functions for text normalization, matching points, and PDF management. We find and download PDF files related to dataset entries and extract their written content. We also provide a sample PDF page for visual inspection of the document layout.

Copy the CodeCopiedUse a different browser
preferred_gt_cols = [
   c for c in text_cols
   if any(k in c.lower() for k in ["ground", "expected", "target", "answer", "content", "text", "markdown", "reference"])
]


evaluation_rows = []
eval_sample = df.sample(min(50, len(df)), random_state=7).to_dict("records")


for row in tqdm(eval_sample, desc="Running lightweight PDF text extraction baseline"):
   pdf_path = None
   for c in doc_cols:
       pdf_path = locate_pdf_path(row.get(c))
       if pdf_path:
           break


   if not pdf_path:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": None,
           "ground_truth_column": None,
           "similarity_score": None,
           "status": "no_pdf_detected"
       })
       continue


   gt_col = None
   gt = None
   for c in preferred_gt_cols:
       if c in row and pd.notna(row[c]):
           gt_col = c
           gt = row[c]
           break


   if gt is None:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": None,
           "similarity_score": None,
           "status": "no_reference_detected"
       })
       continue


   try:
       extracted, local_pdf = extract_pdf_text_from_hf(pdf_path, max_pages=2)
       score = simple_text_similarity(extracted, gt)
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": score,
           "extracted_chars": len(extracted),
           "ground_truth_chars": len(str(gt)),
           "status": "scored"
       })
   except Exception as e:
       evaluation_rows.append({
           "dimension": row.get("_dimension"),
           "row_id": row.get("_row_id"),
           "pdf": pdf_path,
           "ground_truth_column": gt_col,
           "similarity_score": None,
           "status": "error",
           "error": str(e)
       })


eval_df = pd.DataFrame(evaluation_rows)


if eval_df.empty:
   eval_df = pd.DataFrame(columns=[
       "dimension", "row_id", "pdf", "ground_truth_column",
       "similarity_score", "extracted_chars", "ground_truth_chars",
       "status", "error"
   ])


display(eval_df.head(30))


if "status" in eval_df.columns:
   display(eval_df["status"].value_counts().reset_index().rename(columns={"index": "status", "status": "count"}))


if not eval_df.empty and "similarity_score" in eval_df.columns:
   valid_eval = eval_df.dropna(subset=["similarity_score"])


   if len(valid_eval):
       console.print(f"Average lightweight text similarity: {valid_eval['similarity_score'].mean():.3f}")


       plt.figure(figsize=(8, 5))
       plt.hist(valid_eval["similarity_score"], bins=10)
       plt.title("Lightweight Baseline Similarity Distribution")
       plt.xlabel("RapidFuzz Token Set Similarity")
       plt.ylabel("Count")
       plt.show()


       per_dim = valid_eval.groupby("dimension")["similarity_score"].mean().reset_index()
       display(per_dim)


       plt.figure(figsize=(9, 5))
       plt.bar(per_dim["dimension"], per_dim["similarity_score"])
       plt.title("Average Baseline Similarity by Dimension")
       plt.xlabel("Dimension")
       plt.ylabel("Average Similarity")
       plt.xticks(rotation=30, ha="right")
       plt.show()
   else:
       console.print("[yellow]No valid similarity scores were produced. This usually means sampled rows did not contain both detectable PDFs and reference text.[/yellow]")
else:
   console.print("[yellow]No similarity_score column found.[/yellow]")

We use a simple test pipeline by comparing the extracted text with the available reference fields. We calculate the similarity score and analyze how well the simple extraction works in all different measurements. We also visualize results to understand performance trends and limitations.

Copy the CodeCopiedUse a different browser
def inspect_dimension(dimension_name, n=3):
   rows = dimension_data.get(dimension_name, [])
   console.print(Panel.fit(f"Inspecting {dimension_name}: {len(rows)} rows", style="bold magenta"))
   for idx, row in enumerate(rows[:n]):
       console.print(f"n[bold]Example {idx}[/bold]")
       console.print(json.dumps(row, indent=2)[:2500])


for dim in list(dimension_data.keys())[:5]:
   inspect_dimension(dim, n=1)


def make_parsebench_subset(dimension=None, n=20, seed=123):
   subset = df.copy()
   if dimension:
       subset = subset[subset["_dimension"] == dimension]
   if len(subset) == 0:
       return subset
   return subset.sample(min(n, len(subset)), random_state=seed)


subset = make_parsebench_subset(n=20)
display(subset.head())


def create_llm_parser_prompt(row):
   dimension = row.get("_dimension", "unknown")
   candidate_truth = pick_first_existing(row, preferred_gt_cols)
   rule_hint = pick_first_existing(row, rule_cols)


   prompt = f"""
You are evaluating a document parser on ParseBench.


Dimension:
{dimension}


Task:
Parse the PDF page into a structured representation that preserves the information needed for agentic workflows.


Relevant benchmark hint or rule:
{rule_hint if rule_hint is not None else "No obvious rule field detected."}


Reference field preview:
{str(candidate_truth)[:1000] if candidate_truth is not None else "No obvious reference field detected."}


Return:
1. Markdown representation
2. Extracted tables as JSON arrays when tables exist
3. Extracted chart values as JSON when charts exist
4. Layout-sensitive notes when visual grounding matters
"""
   return textwrap.dedent(prompt).strip()


prompt_examples = []
if len(subset):
   for _, row in subset.head(3).iterrows():
       prompt_examples.append(create_llm_parser_prompt(row.to_dict()))


if prompt_examples:
   console.print(Panel.fit("Example prompt for testing an external OCR or VLM parser", style="bold blue"))
   console.print(prompt_examples[0])
else:
   console.print("[yellow]No prompt examples could be created because the subset is empty.[/yellow]")


def compare_parser_outputs(reference, candidate):
   return {
       "token_set_similarity": simple_text_similarity(reference, candidate),
       "partial_ratio": fuzz.partial_ratio(normalize_text(reference), normalize_text(candidate)) / 100 if reference and candidate else None,
       "candidate_length": len(str(candidate)) if candidate else 0,
       "reference_length": len(str(reference)) if reference else 0
   }


if not eval_df.empty and "similarity_score" in eval_df.columns:
   scored_eval = eval_df.dropna(subset=["similarity_score"])


   if len(scored_eval):
       best = scored_eval.sort_values("similarity_score", ascending=False).head(1)
       worst = scored_eval.sort_values("similarity_score", ascending=True).head(1)


       console.print(Panel.fit("Best lightweight baseline example", style="bold green"))
       display(best)


       console.print(Panel.fit("Worst lightweight baseline example", style="bold red"))
       display(worst)
   else:
       console.print("[yellow]No valid similarity scores were available for best/worst comparison.[/yellow]")


output_path = WORKDIR / "parsebench_flattened_sample.csv"
df.head(500).to_csv(output_path, index=False)
console.print(f"Saved flattened sample to: {output_path}")


console.print(Panel.fit("""
Tutorial complete.


What we build:
1. Load ParseBench files directly from Hugging Face.
2. Inspect benchmark dimensions and schemas.
3. Flatten records into a dataframe.
4. Detect linked PDFs and render sample pages when possible.
5. Run a lightweight PyMuPDF extraction baseline.
6. Score extracted text when reference fields are available.
7. Generate reusable prompts for OCR, VLM, and document parser evaluation.
""", style="bold green"))

We examine the sample data set and create subsets for examination. We create structured information for testing external analysis systems, such as OCR and visual language models. Also, we compare outputs, identify best and worst cases, and save processed data for future use.

In conclusion, we have built a complete workflow that allows us to analyze, test, and test parsing documents using the ParseBench dataset. We extracted and compared the text content and also created structured instructions for testing external analysis systems, such as OCR and VLM engines. This approach helps us move beyond simple text output and toward the construction of agent-friendly representations that preserve structure, structure, and semantic meaning. And, we've established a solid foundation that we can extend further to scale, develop analytical models, and integrate document understanding into real-world AI pipelines.


Check it out Full Codes here. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us

The post Code Implementation in Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hug Faces, and Test Benchmarks appeared first on MarkTechPost.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button