5 Useful Python Scripts for Automating Data Cleansing


Photo by Editor
# Introduction
As a data scientist, you know that machine learning models, statistical dashboards, business reports all depend on accurate, consistent, and well-formatted data. But here's the uncomfortable truth: data cleaning consumes a large portion of a project's time. Data scientists and analysts spend more time cleaning and preparing data than actually analyzing it.
The raw data you get is dirty. It has missing values scattered all over the place, duplicate records, inconsistent formats, outliers that distort your models, and text fields full of typos and inconsistencies. Cleaning this data manually is tedious, error-prone, and inefficient.
This article covers five Python scripts specifically designed to perform common and time-consuming data cleaning tasks that you often run into in real-world projects.
🔗 Link to the code on GitHub
# 1. Missing value holder
Pain point: Your dataset has missing values everywhere — some columns are 90% complete, others have less data. You need to decide what to do for each: drop rows, fill with means, use forward fill in the time series, or use complex iteration. Doing this manually for each column is tedious and inconsistent.
What the script does: Automatically analyzes missing value patterns across your dataset, recommends appropriate management strategies based on data type and missingness patterns, and applies selective enforcement methods. Produce a detailed report showing what was lost and how it was handled.
How does this work: The script scans all the columns to calculate percentages and missing patterns, we determine the types of data (numbers, categorical, time of day), and use the appropriate techniques:
- mean/median of numerical data,
- phase mode,
- time series interpretation.
It can detect and handle Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing at Random (MNAR) patterns differently, and log all changes for reproducibility.
⏩ Find the missing value handler script
# 2. Duplicate record detector and resolver
Pain point: Your data has duplicates, but not always exactly the same. Sometimes it's the same customer with slightly different spellings of names, or the same job being recorded twice with slight variations. Finding these ambiguous duplicates and deciding which record to keep requires manual examination of thousands of rows.
What the script does: Identifies both exact and ambiguous duplicate records using configurable matching rules. It groups similar records together, finds their matches, and flags them for review or merges them automatically based on survival rules you define such as keep new, keep complete, and more.
How does this work: The script first finds duplicates using a hash-based comparison for speed. Then it uses the same fuzzy algorithms that are used Levenshtein distance again Jaro-Winkler in key fields to find closest duplicates. Records are grouped into duplicate groups, and survival rules determine which values should be preserved when they are combined. A detailed report shows all duplicate groups found and actions taken.
⏩ Get duplicate detector text
# 3. Data Type Fixer and Standardizer
Pain point: Your CSV import converted everything to a string. Dates are in five different formats. Numbers have money signs and thousands separators. Logical values are represented as “Yes/No”, “Y/N”, “1/0”, and “True/False” all in the same column. Finding consistent data types requires writing custom parsing logic for each dirty column.
What the script does: Automatically detects the target data type of each column, sets the formats, and converts everything to the appropriate types. Handles dates in multiple formats, cleans numeric strings, normalizes representations, and validates results. It provides a modification report that shows what has been modified.
How does this work: Values sample text from each column to understand the target type using pattern matching and heuristics. Then it uses the appropriate syntax: dateutil date variable parsing, number extraction regex, boolean normalization mapping dictionaries. Failed conversions are flagged with problematic values for manual review.
⏩ Get a script to fix the data type
# 4. External Detector
Pain point: Your numerical data has servers that will interfere with your analysis. Some are data entry errors, some are legitimate excess values that you want to keep, and some are vague. You need to identify them, understand their impact, and decide how to handle each case — winsor, cap, delete, or mark for review.
What the script does: Finds marketers using multiple statistical methods such as IQR, Z-score, Forest of Isolationvisualizes their distribution and impact, and implements adaptive treatment strategies. Distinguishes between univariate and multivariate outliers. Generates reports that show external accounts, their amounts, and how they were handled.
How does this work: The script calculates the outliers using the method(s) you selected, flag values that exceed the limits, and apply treatment: removal, capping at percentiles, winsorization, or imputation with boundary values. For multivariate outliers, Isolation Forest or Mahalanobis distance is used. All outliers were included with their actual values for audit purposes.
⏩ Get the outlier detector script
# 5. Text and General Data Cleaner
Pain point: Your text fields are messed up. Words have inconsistent capitalization, addresses use different abbreviations (St. vs Street vs ST), product descriptions have HTML tags and special characters, and free text fields have leading/trailing white space everywhere. Parsing text data requires a number of regex patterns and string operations that are used consistently.
What the script does: Automatically cleans and normalizes text data: normalizes case, removes unwanted characters, expands or stops abbreviations, strips HTML, normalizes white space, and handles unicode issues. Configurable cleanup pipelines allow you to apply different rules to different column types (names, addresses, descriptions, and the like).
How does this work: Text provides a text conversion pipeline that can be configured for each column type. It handles case normalization, whitespace cleaning, special character removal, abbreviation setting using lookup dictionaries, and unicode normalization. Each change is included, and before/after samples are provided for verification.
⏩ Get a clean text of the text
# The conclusion
These five articles address the most time-consuming data cleaning challenges you'll face in real-world projects. Here's a quick summary:
- The missing value handler analyzes and replaces missing data intelligently
- Duplicate detector finds direct and ambiguous duplicates and resolves them
- The data type modifier matches the formats and converts to the appropriate types
- Outlier detector identifies and treats statistical anomalies
- A text cleaner adapts to dirty string data consistently
Each script is designed to be modular. So you can use them individually or combine them into a data cleaning pipeline. Start with a script that addresses your biggest pain point, test it on a sample of your data, customize parameters for your specific use case, and gradually build your automated cleaning workflow.
Happy data cleaning!
Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.



