A Step-by-Step Guide to Building a Complete Pipeline for PII Recovery and Recovery with OpenAI's Privacy Filter

0 0 5 minutes read

A Step-by-Step Guide to Building a Complete Pipeline for PII Recovery and Recovery with OpenAI's Privacy Filter

In this tutorial, we build a complete, production-style pipeline to retrieve and reorder personally identifiable information using OpenAI Privacy Filter. We start by setting up the environment and uploading a token encryption model that identifies multiple categories of sensitive data, including names, emails, phone numbers, addresses, and passwords. We then design helper functions to normalize labels, extract structured spaces, and convert raw model output into usable formats. Since then, we use a configurable reprogramming system that allows us to replace sensitive entities with logical placeholders, preserving privacy and providing context clarity. Throughout this process, we test the pipeline on selected examples, convert the output into structured data frames, and configure the system for batch processing and real-world use.

!pip install -q -U transformers accelerate torch pandas matplotlib huggingface_hub


import os, re, json, time, textwrap, warnings
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline


warnings.filterwarnings("ignore")


MODEL_ID = "openai/privacy-filter"
OUT_DIR = Path("/content/privacy_filter_outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)


device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32


print("Device:", "GPU" if torch.cuda.is_available() else "CPU")
print("Torch dtype:", torch_dtype)
print("Model:", MODEL_ID)

We install all the required libraries and configure the runtime environment of the pipeline. We configure the device selection and implement the output storage methods. We also print the system information to make sure everything is ready before loading the model.

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForTokenClassification.from_pretrained(
   MODEL_ID,
   torch_dtype=torch_dtype,
   device_map="auto" if torch.cuda.is_available() else None
)


classifier = pipeline(
   task="token-classification",
   model=model,
   tokenizer=tokenizer,
   aggregation_strategy="simple",
   device=device if not torch.cuda.is_available() else None
)


LABEL_MASKS = {
   "account_number": "[ACCOUNT_NUMBER]",
   "private_address": "[PRIVATE_ADDRESS]",
   "private_email": "[PRIVATE_EMAIL]",
   "private_person": "[PRIVATE_PERSON]",
   "private_phone": "[PRIVATE_PHONE]",
   "private_url": "[PRIVATE_URL]",
   "private_date": "[PRIVATE_DATE]",
   "secret": "[SECRET]"
}

def normalize_label(label):
   label = label.replace("B-", "").replace("I-", "").replace("E-", "").replace("S-", "")
   return label.strip()


def detect_pii(text):
   raw = classifier(text)
   spans = []
   for item in raw:
       label = normalize_label(item.get("entity_group", item.get("entity", "")))
       if label == "O" or not label:
           continue
       spans.append({
           "label": label,
           "score": float(item["score"]),
           "text": item["word"],
           "start": int(item["start"]),
           "end": int(item["end"])
       })
   spans = sorted(spans, key=lambda x: (x["start"], x["end"]))
   return spans


def redact_text(text, spans, min_score=0.50, mode="typed"):
   filtered = [s for s in spans if s["score"] >= min_score]
   filtered = sorted(filtered, key=lambda x: x["start"], reverse=True)
   redacted = text
   for span in filtered:
       replacement = LABEL_MASKS.get(span["label"], "[PII]") if mode == "typed" else "[REDACTED]"
       redacted = redacted[:span["start"]] + replacement + redacted[span["end"]:]
   return redacted


def privacy_report(text, min_score=0.50):
   spans = detect_pii(text)
   redacted = redact_text(text, spans, min_score=min_score)
   return {
       "original_text": text,
       "redacted_text": redacted,
       "span_count": len([s for s in spans if s["score"] >= min_score]),
       "spans": [s for s in spans if s["score"] >= min_score]
   }

We define helper functions to normalize labels and extract PII scope from model predictions. We perform a regression function that changes the sensitive segments based on the confidence limits. We combine everything into a single reporting function that returns programmed output.

sample_texts = [
   "My name is Alice Smith and my email is [email protected]. Call me at +1 415 555 0189.",
   "Patient Rohan Mehta visited on 2025-04-11 and lives at 221B Baker Street, London.",
   "Use API key sk-test-51HxYzDemoSecret987 and send the invoice to [email protected].",
   "The public website is  but Jane Doe's private portal is 
   "Account number 123456789012 was linked to Ahmed Khan on 12 March 2024.",
   "This sentence has no private information and should mostly remain unchanged."
]


reports = []
for i, text in enumerate(sample_texts, 1):
   report = privacy_report(text, min_score=0.50)
   report["example_id"] = i
   reports.append(report)


for r in reports:
   print("n" + "=" * 100)
   print("Example:", r["example_id"])
   print("Original:", r["original_text"])
   print("Redacted:", r["redacted_text"])
   print("Detected spans:")
   print(json.dumps(r["spans"], indent=2, ensure_ascii=False))


rows = []
for r in reports:
   for s in r["spans"]:
       rows.append({
           "example_id": r["example_id"],
           "label": s["label"],
           "score": s["score"],
           "detected_text": s["text"],
           "start": s["start"],
           "end": s["end"],
           "original_text": r["original_text"],
           "redacted_text": r["redacted_text"]
       })


df = pd.DataFrame(rows)
display(df)

We create a sample input and run it through a pipeline to test recovery and reproducibility. We collect structured results and print both original and reconstructed text for comparison. We also convert the output into a data frame for ease of analysis.

json_path = OUT_DIR / "privacy_filter_reports.json"
csv_path = OUT_DIR / "privacy_filter_spans.csv"


with open(json_path, "w", encoding="utf-8") as f:
   json.dump(reports, f, indent=2, ensure_ascii=False)


df.to_csv(csv_path, index=False)


print("nSaved JSON:", json_path)
print("Saved CSV:", csv_path)


if len(df):
   label_counts = df["label"].value_counts()
   plt.figure(figsize=(10, 5))
   label_counts.plot(kind="bar")
   plt.title("Detected PII Categories")
   plt.xlabel("PII Category")
   plt.ylabel("Detected Span Count")
   plt.xticks(rotation=35, ha="right")
   plt.tight_layout()
   plt.show()


   plt.figure(figsize=(10, 5))
   df["score"].plot(kind="hist", bins=10)
   plt.title("Detection Confidence Distribution")
   plt.xlabel("Confidence Score")
   plt.ylabel("Frequency")
   plt.tight_layout()
   plt.show()


def compare_thresholds(text, thresholds=(0.30, 0.50, 0.70, 0.90)):
   spans = detect_pii(text)
   results = []
   for threshold in thresholds:
       kept = [s for s in spans if s["score"] >= threshold]
       results.append({
           "threshold": threshold,
           "span_count": len(kept),
           "redacted_text": redact_text(text, spans, min_score=threshold)
       })
   return pd.DataFrame(results)


threshold_demo = compare_thresholds(sample_texts[0])
display(threshold_demo)

We save the processed output in JSON and CSV formats for further reuse. We visualize the categories of PII obtained and the distribution of confidence using the plots. We also analyze how changing thresholds affect detection and reactivity.

long_document = """
Customer Support Transcript:
Agent: Hello, may I confirm your name?
Customer: My name is PSP.
Agent: Thanks. Could you confirm your email?
Customer: [email protected].
Agent: And your phone number?
Customer: +91 xxxxx xxxxx.
Agent: Your service address is 45 MG Road, Bengaluru, Karnataka.
Customer: Yes. Also, my backup email is [email protected].
Agent: Please do not share passwords or OTPs.
Customer: The temporary token I received is ghp_demoSecretToken123456.
"""


long_report = privacy_report(long_document, min_score=0.50)


print("nLONG DOCUMENT REDACTION")
print("=" * 100)
print(long_report["redacted_text"])
print("nStructured spans:")
print(json.dumps(long_report["spans"], indent=2, ensure_ascii=False))


def pii_audit_table(texts, min_score=0.50):
   audit_rows = []
   for idx, text in enumerate(texts, 1):
       result = privacy_report(text, min_score=min_score)
       labels = Counter([s["label"] for s in result["spans"]])
       audit_rows.append({
           "id": idx,
           "original_chars": len(text),
           "redacted_chars": len(result["redacted_text"]),
           "span_count": result["span_count"],
           "labels_found": dict(labels),
           "redacted_text": result["redacted_text"]
       })
   return pd.DataFrame(audit_rows)


audit_df = pii_audit_table(sample_texts + [long_document], min_score=0.50)
display(audit_df)


audit_path = OUT_DIR / "privacy_filter_audit.csv"
audit_df.to_csv(audit_path, index=False)
print("Saved audit CSV:", audit_path)


custom_text = input("nEnter your own text for PII redaction, or press Enter to skip:n")


if custom_text.strip():
   custom_report = privacy_report(custom_text, min_score=0.50)
   print("nOriginal:")
   print(custom_report["original_text"])
   print("nRedacted:")
   print(custom_report["redacted_text"])
   print("nSpans:")
   print(json.dumps(custom_report["spans"], indent=2, ensure_ascii=False))
else:
   print("Skipped custom input.")


print("nTutorial complete.")

We test the pipeline on a long, realistic document to test for robustness. We create an audit-style summary that shows the statistics and categories of PII received. We also allow custom user input so we can use the privacy filter collaboratively.

In conclusion, we have developed a robust and extensible privacy filtering workflow that goes beyond simple detection. We systematically tested model predictions, applied confidence limits, and compared different adjustment strategies to understand their impact. We also created structured reports, identified discovery patterns, and sent results in JSON and CSV formats for downstream testing and integration. This approach allows us to build reliable privacy protections into data pipelines, ensuring that sensitive information is regularly identified and handled responsibly while maintaining the usability of the underlying data.

Check it out Full Codes here. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 2 hours ago

0 0 5 minutes read