Getting Began with Microsoft’s Presidio: A Step-by-Step Information to Detecting and Anonymizing Personally Identifiable Data PII in Textual content

On this tutorial, we’ll discover use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable data (PII) in free-form textual content. Constructed on high of the environment friendly spaCy NLP library, Presidio is each light-weight and modular, making it simple to combine into real-time purposes and pipelines.

We are going to cowl :

Arrange and set up the mandatory Presidio packages

Detect widespread PII entities resembling names, cellphone numbers, and bank card particulars

Outline customized recognizers for domain-specific entities (e.g., PAN, Aadhaar)

Create and register customized anonymizers (like hashing or pseudonymization)

Reuse anonymization mappings for constant re-anonymization

Putting in the libraries

To get began with Presidio, you’ll want to put in the next key libraries:

presidio-analyzer: That is the core library accountable for detecting PII entities in textual content utilizing built-in and customized recognizers.

presidio-anonymizer: This library gives instruments to anonymize (e.g., redact, substitute, hash) the detected PII utilizing configurable operators.

spaCy NLP mannequin (en_core_web_lg): Presidio makes use of spaCy below the hood for pure language processing duties like named entity recognition. The en_core_web_lg mannequin gives high-accuracy outcomes and is advisable for English-language PII detection.

pip set up presidio-analyzer presidio-anonymizer
python -m spacy obtain en_core_web_lg

You would possibly must restart the session to put in the libraries, if you’re utilizing Jupyter/Colab.

Presidio Analyzer

Fundamental PII Detection

On this block, we initialize the Presidio Analyzer Engine and run a fundamental evaluation to detect a U.S. cellphone quantity from a pattern textual content. We additionally suppress lower-level log warnings from the Presidio library for cleaner output.

The AnalyzerEngine hundreds spaCy’s NLP pipeline and predefined recognizers to scan the enter textual content for delicate entities. On this instance, we specify PHONE_NUMBER because the goal entity.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Arrange the engine, hundreds the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()

# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(outcomes)

Making a Customized PII Recognizer with a Deny Checklist (Educational Titles)

This code block exhibits create a customized PII recognizer in Presidio utilizing a easy deny listing, supreme for detecting fastened phrases like educational titles (e.g., “Dr.”, “Prof.”). The recognizer is added to Presidio’s registry and utilized by the analyzer to scan enter textual content.
Whereas this tutorial covers solely the deny listing strategy, Presidio additionally helps regex-based patterns, NLP fashions, and exterior recognizers. For these superior strategies, seek advice from the official docs: Including Customized Recognizers.

Presidio Analyzer

Fundamental PII Detection

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Arrange the engine, hundreds the NLP module (spaCy mannequin by default) and different PII recognizers
analyzer = AnalyzerEngine()

# Name analyzer to get outcomes
outcomes = analyzer.analyze(textual content="My cellphone quantity is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language="en")
print(outcomes)

Making a Customized PII Recognizer with a Deny Checklist (Educational Titles)

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a customized sample recognizer utilizing deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the up to date registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze textual content
textual content = "Prof. John Smith is assembly with Dr. Alice Brown."
outcomes = analyzer.analyze(textual content=textual content, language="en")

for lead to outcomes:
    print(end result)

Presidio Anonymizer

This code block demonstrates use the Presidio Anonymizer Engine to anonymize detected PII entities in a given textual content. On this instance, we manually outline two PERSON entities utilizing RecognizerResult, simulating output from the Presidio Analyzer. These entities symbolize the names “Bond” and “James Bond” within the pattern textual content.

We use the “substitute” operator to substitute each names with a placeholder worth (“BIP”), successfully anonymizing the delicate information. That is finished by passing an OperatorConfig with the specified anonymization technique (substitute) to the AnonymizerEngine.

This sample may be simply prolonged to use different built-in operations like “redact”, “hash”, or customized pseudonymization methods.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize operate with the textual content, 
# analyzer outcomes (doubtlessly coming from presidio-analyzer) and
# Operators to get the anonymization output:
end result = engine.anonymize(
    textual content="My title is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("substitute", {"new_value": "BIP"})},
)

print(end result)

Customized Entity Recognition, Hash-Based mostly Anonymization, and Constant Re-Anonymization with Presidio

On this instance, we take Presidio a step additional by demonstrating:

✅ Defining customized PII entities (e.g., Aadhaar and PAN numbers) utilizing regex-based PatternRecognizers

🔐 Anonymizing delicate information utilizing a customized hash-based operator (ReAnonymizer)

♻️ Re-anonymizing the identical values persistently throughout a number of texts by sustaining a mapping of unique → hashed values

We implement a customized ReAnonymizer operator that checks if a given worth has already been hashed and reuses the identical output to protect consistency. That is notably helpful when anonymized information must retain some utility — for instance, linking data by pseudonymous IDs.

Outline a Customized Hash-Based mostly Anonymizer (ReAnonymizer)

This block defines a customized Operator referred to as ReAnonymizer that makes use of SHA-256 hashing to anonymize entities and ensures the identical enter at all times will get the identical anonymized output by storing hashes in a shared mapping.

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict

class ReAnonymizer(Operator):
    """
    Anonymizer that replaces textual content with a reusable SHA-256 hash,
    saved in a shared mapping dict.
    """

    def function(self, textual content: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")

        if mapping is None:
            increase ValueError("Lacking `entity_mapping` in params")

        # Examine if already hashed
        if entity_type in mapping and textual content in mapping[entity_type]:
            return mapping[entity_type][text]

        # Hash and retailer
        hashed = ""
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed

    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            increase ValueError("You could move an 'entity_mapping' dictionary.")

    def operator_name(self) -> str:
        return "reanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Outline Customized PII Recognizers for PAN and Aadhaar Numbers

We outline two customized regex-based PatternRecognizers — one for Indian PAN numbers and one for Aadhaar numbers. These will detect customized PII entities in your textual content.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Sample

# Outline customized recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    title="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"b[A-Z]{5}[0-9]{4}[A-Z]b", rating=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    title="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"bd{4}[- ]?d{4}[- ]?d{4}b", rating=0.8)],
    supported_language="en"
)

Set Up Analyzer and Anonymizer Engines

Right here we arrange the Presidio AnalyzerEngine, register the customized recognizers, and add the customized anonymizer to the AnonymizerEngine.

from presidio_anonymizer import AnonymizerEngine, OperatorConfig

# Initialize analyzer and register customized recognizers
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)

# Initialize anonymizer and add customized operator
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)

# Shared mapping dictionary for constant re-anonymization
entity_mapping = {}

Analyze and Anonymize Enter Texts

We analyze two separate texts that each embrace the identical PAN and Aadhaar values. The customized operator ensures they’re anonymized persistently throughout each inputs.

from pprint import pprint

# Instance texts
text1 = "My PAN is ABCDE1234F and Aadhaar quantity is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first textual content
results1 = analyzer.analyze(textual content=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second textual content
results2 = analyzer.analyze(textual content=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

View Anonymization Outcomes and Mapping

Lastly, we print each anonymized outputs and examine the mapping used internally to keep up constant hashes throughout values.

print("📄 Unique 1:", text1)
print("🔐 Anonymized 1:", anon1.textual content)
print("📄 Unique 2:", text2)
print("🔐 Anonymized 2:", anon2.textual content)

print("n📦 Mapping used:")
pprint(entity_mapping)

Try the Codes. All credit score for this analysis goes to the researchers of this venture. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their utility in numerous areas.