Anonymization of Productive Data for Data Science through Mimesis

# Introduction
Production data it is often subject to significant privacy and compliance restrictions. For this reason, the anonymity of such data becomes critical in almost every real-world data science project that involves the launch of a data-driven product, service, or solution.
Mimesis is an open source Python library that stands out for its ability to generate “fake” real data in a very efficient way. Mimesis works locally and provides a free, robust data pipeline solution. This article will show you how to use this library to hide sensitive production data, based on a step-by-step example that you can easily try in your IDE or in the notes environment.
# Step by Step Process
If you think you are new to Mimesis, you may need to install it in your Python environment with a command like:
Remember to add ! at the beginning of pip command if you are working in a Google Colab notebook environment or similar.
Now we are ready to start! We will consider the situation surrounding the registration system based on the category of the software product. For convenience, we will collectively generate a toy dataset that contains data about customers and their subscription type. There is very sensitive data in some of the data set variables, as you can see below:
import pandas as pd
# Creation of a mock "production" customer dataset
production_data = {
'user_id': [101, 102, 103, 104],
'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}
df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())
While registration categories are not sensitive data in our example, usernames, emails, and phone numbers are. With the help of Mimesis, we can implement a provider: a type of integrated template for making anonymous data suitable for the type of data we have. Since our data views are associated with individuals, we may import and use i Person class — a provider that, given a specific language such as English and aided by a random seed, can be used to create fake identities for sensitive, sensitive personal data:
from mimesis import Person
from mimesis.locales import Locale
# Initializing a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)
From this point forward, the process of processing non-personally identifiable information (PII) is very simple. All that is required is to replace the sensitive columns – specified by us – with newly generated data from the Mimesis generator for the human domain. This is done iteratively using the DataFrame object that contains the entire dataset and calls the appropriate Mimesis functions to create virtual substitutes for the data, depending on each given attribute:
# 1. Replacing real names with fake, realistic names
df['real_name'] = [person.full_name() for _ in range(len(df))]
# 2. Replacing real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]
# 3. Replacing real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]
# 4. Renaming the column to reflect that it is no longer the real name
df.rename(columns={'real_name': 'anon_name'}, inplace=True)
Note above that Mimesis' Person class provides dedicated functions for generating complete names, emails, and phone numbers, among others. In addition, the name column is renamed to indicate that the name entered in the updated dataset is no longer original but anonymized.
We now confirm the results in view of the converted DataFrame. Sensitive PII fields have completely changed: they are now written on top of artificial data that looks legitimate, keeping the entire dataset organized and important information to analyze the river as subscription_tier completely.
print("n--- Anonymized Data for Data Science Analyses ---")
print(df.head())
Output:
--- Anonymized Data for Data Science Analyses ---
user_id anon_name email phone
0 101 Anthony Reilly [email protected] +13312271333
1 102 Kai Day [email protected] +1-205-759-3586
2 103 Cleveland Osborn [email protected] +13691067988
3 104 Zack Holder [email protected] +1-574-481-3676
subscription_tier
0 Premium
1 Basic
2 Basic
3 Enterprise
It's delicious! We have just used a few simple steps to anonymize several fields of sensitive data commonly found in real world, production data science projects and analysis – all for free, because Mimesis is open source.
To finish, here are some best practices again which is observed by performing the anonymization procedure we have just included:
- We changed the columns directly to
DataFrame. Depending on your context, consider whether this is the right approach, or you may want to store new information in a different location.DataFrameif there is a risk of losing the original data. - Mimesis works in a data-compatible manner, so the generated data matches the expected data types.
- Seeding helps keep the generated information consistent across different runs and helps reproducibility.
# Wrapping up
In this article, we showed how to use Mimesis – a Python library with the power to anonymize and anonymize data – to convert a sensitive production dataset into a version that can be safely used for further analysis without compromising private information such as real people's PII.
Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.



