Generative AI

Microsoft's Presidio: Step Guide for Step Guide to Receive Prohibited PiiX

In this lesson, we will investigate how we can use Microsoft's Presidio, an open source framework for receiving, analysis, and commencement of visual information (PII) by the FREE Forms. Designed on top of the efficient Spay NLP library, a second Presidio is not sight and stimulating, making it easy to integrate real-time and pipelines.

We will cover that we must say:

  • Set up and enter the required Persidio packages
  • See the standard PII businesses such as names, phone numbers, and credit card details
  • Explain to customize customary partners (eg pan, aadhaar)
  • Create and register a custom ignore (such as the bath or division of pseuduncy)
  • Re-use unknown mappings to know nothing

Installing libraries

To start with Presidio, you will need to install the following key libraries:

  • Presidio-Analyzer: This is the core library to find PII frames in the text using the internal and custom employers.
  • Pressidio-Anomizer: This library provides extension tools (eg rep To restore, hash) the PII received using organized operators.
  • Spacy NLP model (En_core_web_lg): Prissidio uses Spacy under the Hood for evolution activities such as a business recognition called. En_core_web_lg model offers results of accurate accuracy and recommended by the Igla-language implementation of PII.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

You may need to restart the session to include libraries, when using Jobyter / Colab.

Persidio Analyzer

The foundation of the basic pii

In this Block, we start the Personzer engine engine and use basic analysis to find the US phone number from the sample text. We also press the lower log alerts from the Presidio library to cleanse your cleanliness.

Analyzerine loads the NLP's NLP pipe and Retefies Ancits to scan the installation text of sensitive organizations. In this example, we specify the phone_noni

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

To create a custom PII view of Doy List (Educational Articles)

This block code shows the Presidi Locculizer using a simple list of denials, ready to find the prescribed learning list (eg.
While this lesson includes only the Donty list, Presidio and supports the Regex-based patterns, NLP models, and external overseers. In those advanced ways, refer to official documents: Adding customization.

Persidio Analyzer

The foundation of the basic pii

In this Block, we start the Personzer engine engine and use basic analysis to find the US phone number from the sample text. We also press the lower log alerts from the Presidio library to cleanse your cleanliness.

Analyzerine loads the NLP's NLP pipe and Retefies Ancits to scan the installation text of sensitive organizations. In this example, we specify the phone_noni

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(results)

To create a custom PII view of Doy List (Educational Articles)

This block code shows the Presidi Locculizer using a simple list of denials, ready to find the prescribed learning list (eg.
While this lesson includes only the Donty list, Presidio and supports the Regex-based patterns, NLP models, and external overseers. In those advanced ways, refer to official documents: Adding customization.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")

for result in results:
    print(result)

Persidio Anyizer

This block code shows how to use the Presidio Anes an Improvement engine for identification of businesses found in PII the text provided. In this example, we manually explain the two companies using Abriczarersult, imitating the outgoing from Presidio Analyzer. These structures represent the words “bond” and “James Bond” in the sample text.

Using the operator “Replace” to enter both words in number of owners (“BIP”), to avoid sensitive data. This is done by passing the essaycoconfig with a desired strategy (instead) in anomizerentine.

This pattern can easily be expanded to use additional functional works as “Redact”, “hash” Pseusudulk strategies.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize function with the text, 
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Custom recognition, hash-based structure, and consistent reduction in Presidio

In this example, we take the Presidio a step and showing:

  • ✅ Description PIIs customized PIIs (eg, Aadhaar and Pan Numbers) using Regex-based PatelresConnizers
  • 🔐 Credit data name using customized operator
  • ♻️ Recognizing the same prices consistently in all multiple texts by keeping the Hashed price map

We use the Realizer Currition operator that tests that the amount provided is already good and use the same effect to maintain consistency. This is especially useful when an unknown data requires keeping a specific work – for example, to connect records with pseudonymous IDs.

Describe Anhh-Det of Hehh-Ded Making (Renanconizer)

This block describes the custom organization called the Renalankerizer using the Sha-256 bath can reduce the scenario and ensure that the same lastes the same Hashes on the stolen map.

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict

class ReAnonymizer(Operator):
    """
    Anonymizer that replaces text with a reusable SHA-256 hash,
    stored in a shared mapping dict.
    """

    def operate(self, text: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")

        if mapping is None:
            raise ValueError("Missing `entity_mapping` in params")

        # Check if already hashed
        if entity_type in mapping and text in mapping[entity_type]:
            return mapping[entity_type][text]

        # Hash and store
        hashed = ""
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed

    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            raise ValueError("You must pass an 'entity_mapping' dictionary.")

    def operator_name(self) -> str:
        return "reanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Describe PII Customization for PAN and Aadhaar Numbers

It describes two Patlourreconrecontrecontrecontrecontrecontrecontrophorrows for Regex-based Patterlic this will see PII frames customized in your text.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Define custom recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    name="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", score=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    name="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", score=0.8)],
    supported_language="en"
)

Set analyzer and unknown engines

Here we are set on Personzerengine, registered the custom welders, and add a custom ignorance to Anonzerdengone.

from presidio_anonymizer import AnonymizerEngine, OperatorConfig

# Initialize analyzer and register custom recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)

# Initialize anonymizer and add custom operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)

# Shared mapping dictionary for consistent re-anonymization
entity_mapping = {}

Analyze and Make Input documents

We analyze two different texts containing both include the same PAN with the same Aadhaar prices. The custom operator ensures that it is unknown anonymously in all details.

from pprint import pprint

# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

See anonymous results and map

Finally, we print both unknown results and test the map used in the inside to keep unstable hardhes in all prices.

print("📄 Original 1:", text1)
print("🔐 Anonymized 1:", anon1.text)
print("📄 Original 2:", text2)
print("🔐 Anonymized 2:", anon2.text)

print("n📦 Mapping used:")
pprint(entity_mapping)

Look Codes. All credit for this study goes to research for this project. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.


I am the student of the community engineering (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in data science, especially neural networks and their application at various locations.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button