ANI

Creating a simple data quality DSL in Python

Creating a simple data quality DSL in Python
Photo by the Author

The obvious Getting started

Data validation code in Python is often a pain in the ass. Business rules are buried by Nest if Statements, integration of logic and error handling, and adding new checks often means working with process functions to find the right place to put the code. Of course, there are data validation methods you can use, but we'll focus on building something simple and useful with a Python program.

Let's write Domain-Specific Language (DSL) types by building a vocabulary directly through data validation. Instead of writing generic Python code, create special functions and classes that express validation rules in terms of the same way you think about the problem.

For data verification, this means rules read as business requirements: “Customers must be between 18 and 120” or “email addresses must have a valid domain.” You'd like a DSL to manage data analytics and breach reporting, while you focus on performance what Valid data is visible. The result is an intuitive readable, easy to maintain and test, and easy to extend. So, let's start coding!

🔗 Link to the code on GitHub

The obvious Why create DSL?

Consider validating customer data with Python:

def validate_customers(df):
    errors = []
    if df['customer_id'].duplicated().any():
        errors.append("Duplicate IDs")
    if (df['age'] < 0).any():
        errors.append("Negative ages")
    if not df['email'].str.contains('@').all():
        errors.append("Invalid emails")
    return errors

This method of Hardcode Revision Logic, mixes business rules with error handling, and is undesirable as the rules increase. Instead, we're looking to write a DSL that separates the concerns and creates usable authentication components.

Instead of writing process validation functions, DSL allows you to express rules that are read as business requirements:

# Traditional approach
if df['age'].min() < 0 or df['age'].max() > 120:
    raise ValueError("Invalid ages found")

# DSL approach  
validator.add_rule(Rule("Valid ages", between('age', 0, 120), "Ages must be 0-120"))

The DSL method separates what is validated (business rules) from error handling (error reporting).. This makes logic verification accessible to test, reproducible, and read by non-denominationals.

The obvious Creating sample data

Start by mixing a sample, meaningful e-commerce customer data that contains common quality issues:

import pandas as pd

customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 103, 105],
    'email': ['[email protected]', 'invalid-email', '', '[email protected]', '[email protected]'],
    'age': [25, -5, 35, 200, 28],
    'total_spent': [250.50, 1200.00, 0.00, -50.00, 899.99],
    'join_date': ['2023-01-15', '2023-13-45', '2023-02-20', '2023-02-20', '']
}) # Note: 2023-13-45 is an intentionally malformed date.

This data contains duplicate customer IDs, invalid email formats, impossible ages, negative currency rates, and unlimited dates. That should work fine for getting validation rules.

The obvious Writing Assertive Logic

// Creating a Law section

Let's start with a simple script Rule The section that includes the validation logic:

class Rule:
    def __init__(self, name, condition, error_msg):
        self.name = name
        self.condition = condition
        self.error_msg = error_msg
    
    def check(self, df):
        # The condition function returns True for VALID rows.
        # We use ~ (bitwise NOT) to select the rows that VIOLATE the condition.
        violations = df[~self.condition(df)]
        if not violations.empty:
            return {
                'rule': self.name,
                'message': self.error_msg,
                'violations': len(violations),
                'sample_rows': violations.head(3).index.tolist()
            }
        return None

This page condition parameter accepts any function that takes a dataframe and returns a boolean Series showing valid lines. The tilde operator (~) to enter this boolean Series to identify violations. When a violation exists, check The method returns detailed information including the rule name, error message, error count, and a sample line of debugging indices.

This design separates understanding verification from error reporting. This page condition The function only looks at the business rule while Rule The class handles error information statically.

// Adding Miscellaneous Rules

Next, let's code a DataValidator Section that governs rule sets:

class DataValidator:
    def __init__(self):
        self.rules = []
    
    def add_rule(self, rule):
        self.rules.append(rule)
        return self # Enables method chaining
    
    def validate(self, df):
        results = []
        for rule in self.rules:
            violation = rule.check(df)
            if violation:
                results.append(violation)
        return results

This page add_rule The path returns self enabling Method Chaning. This page validate The method issues all rules independently and collects reports of violations. This method ensures that one rule fails and does not prevent others from working.

// Creating Learnable Contexts

Remember that when you give something to Rule Section, we also need condition work. This can be any function that takes a dataframe and returns a boolean Series. While simple lambda functions work, they are not very easy to learn. So let's write Helper functions to create a readable validation dictionary:

def not_null(column):
    return lambda df: df[column].notna()

def unique_values(column):
    return lambda df: ~df.duplicated(subset=[column], keep=False)

def between(column, min_val, max_val):
    return lambda df: df[column].between(min_val, max_val)

Each function creates a lambda that runs on a Pandas Boolean function.

  • This page not_null Helper uses pandas' notna() method to find non-null values.
  • This page unique_values Use of Aid duplicated(..., keep=False) With a low-level parameter to flag all duplicate events, ensure a more accurate count.
  • This page between The assistant uses pandas between() A method that automatically uses stem errors.

Pattern matching, general expressions Be specific:

import re

def matches_pattern(column, pattern):
    return lambda df: df[column].str.match(pattern, na=False)

This page na=False The parameter validates missing values ​​are treated as a validation failure rather than a match, which is usually the desired behavior for required fields.

The obvious Creating a data layer for sample data

Now let's create a customer data service to see how this DSL works:

validator = DataValidator()

validator.add_rule(Rule(
   "Unique customer IDs", 
   unique_values('customer_id'),
   "Customer IDs must be unique across all records"
))

validator.add_rule(Rule(
   "Valid email format",
   matches_pattern('email', r'^[^@s]+@[^@s]+.[^@s]+$'),
   "Email addresses must contain @ symbol and domain"
))

validator.add_rule(Rule(
   "Reasonable customer age",
   between('age', 13, 120),
   "Customer age must be between 13 and 120 years"
))

validator.add_rule(Rule(
   "Non-negative spending",
   lambda df: df['total_spent'] >= 0,
   "Total spending amount cannot be negative"
))

Each rule follows the same pattern: Descriptive name, validation status, and error message.

  • The first rule applies unique_values Helper Function to check for duplicate customer IDs.
  • The second rule works in a general way to show a comparison method to verify email formats. The pattern requires at least one character before and after the @ sign, and a domain extension.
  • The third rule applies between Grade verification assistant, setting appropriate age limits for customers.
  • The last rule uses a lambda function with inline mode total_spent the prices are unrealistic.

Notice how each rule reads almost like a business requirement. Validator collects these rules and can execute them all against any dataframe with the same column names:

issues = validator.validate(customers)

for issue in issues:
    print(f"❌ Rule: {issue['rule']}")
    print(f"Problem: {issue['message']}")
    print(f"Affected rows: {issue['sample_rows']}")
    print()

Releases clearly identify specific problems and their locations in the database, enabling debugging. With the sample data, you will get the following result:

Validation Results:
❌ Rule: Unique customer IDs
   Problem: Customer IDs must be unique across all records
   Violations: 2
   Affected rows: [2, 3]

❌ Rule: Valid email format
   Problem: Email addresses must contain @ symbol and domain
   Violations: 3
   Affected rows: [1, 2, 4]

❌ Rule: Reasonable customer age
   Problem: Customer age must be between 13 and 120 years
   Violations: 2
   Affected rows: [1, 3]

❌ Rule: Non-negative spending
   Problem: Total spending amount cannot be negative
   Violations: 1
   Affected rows: [3]

The obvious Adding Column Validation

Real business rules often involve relationships between columns. Custom Lambda functions handle complex Validation Logic:

def high_spender_email_required(df):
    high_spenders = df['total_spent'] > 500
    has_valid_email = df['email'].str.contains('@', na=False)
    # Passes if: (Not a high spender) OR (Has a valid email)
    return ~high_spenders | has_valid_email

validator.add_rule(Rule(
    "High Spenders Need Valid Email",
    high_spender_email_required,
    "Customers spending over $500 must have valid email addresses"
))

This rule uses boolean logic where high net worth customers must have proper email addresses, but low net worth customers can have contact information. Speech ~high_spenders | has_valid_email It is translated as “not a high spender or has a valid email,” which allows low creditors to succeed regardless of the status of the email.

The obvious Managing Date Verification

Date validation needs to be handled carefully since Date Parsing can fail:

def valid_date_format(column, date_format="%Y-%m-%d"):
    def check_dates(df):
        # pd.to_datetime with errors="coerce" turns invalid dates into NaT (Not a Time)
        parsed_dates = pd.to_datetime(df[column], format=date_format, errors="coerce")
        # A row is valid if the original value is not null AND the parsed date is not NaT
        return df[column].notna() & parsed_dates.notna()
    return check_dates

validator.add_rule(Rule(
    "Valid Join Dates",
    valid_date_format('join_date'),
    "Join dates must follow YYYY-MM-DD format"
))

Validation passes only when the original value is null and the date formed is valid (ie, not NaT). We remove the unnecessary try-except block, dependence on errors="coerce" between pd.to_datetime to handle unfixed strings with grace by turning them into NaTat that time he was caught by parsed_dates.notna().

The obvious Writing patterns to assemble the ornament

For pipe production, you can write decorative patterns that provide clean integration:

def validate_dataframe(validator):
    def decorator(func):
        def wrapper(df, *args, **kwargs):
            issues = validator.validate(df)
            if issues:
                error_details = [f"{issue['rule']}: {issue['violations']} violations" for issue in issues]
                raise ValueError(f"Data validation failed: {'; '.join(error_details)}")
            return func(df, *args, **kwargs)
        return wrapper
    return decorator

# Note: 'customer_validator' needs to be defined globally or passed in a real implementation
# Assuming 'customer_validator' is the instance we built earlier
# @validate_dataframe(customer_validator)
def process_customer_data(df):
    return df.groupby('age').agg({'total_spent': 'sum'})

This decorator validates data that passes validation before starting processing, preventing corrupted data from propagating through the pipeline. The decorator raises descriptive errors that include certain guaranteed failures. Commented out the Code Snippet to note that customer_validator it will need to be accessible to the decorator.

The obvious Extending the pattern

You can extend the DSL to include additional validation rules as needed:

# Statistical outlier detection
def within_standard_deviations(column, std_devs=3):
    # Valid if absolute difference from mean is within N standard deviations
    return lambda df: abs(df[column] - df[column].mean()) <= std_devs * df[column].std()

# Referential integrity across datasets
def foreign_key_exists(column, reference_df, reference_column):
    # Valid if value in column is present in the reference_column of the reference_df
    return lambda df: df[column].isin(reference_df[reference_column])

# Custom business logic
def profit_margin_reasonable(df):
    # Ensures 0 <= margin <= 1
    margin = (df['revenue'] - df['cost']) / df['revenue']
    return (margin >= 0) & (margin <= 1)

This is how you can build Addition Logic as poison functions that return Boolean arrays.

Here's an example of how you can use the data DSL we've built in the sample code, considering the functions of the module called data_quality_dsl:

import pandas as pd
from data_quality_dsl import DataValidator, Rule, unique_values, between, matches_pattern

# Sample data
df = pd.DataFrame({
    'user_id': [1, 2, 2, 3],
    'email': ['[email protected]', 'invalid', '[email protected]', ''],
    'age': [25, -5, 30, 150]
})

# Build validator
validator = DataValidator()
validator.add_rule(Rule("Unique users", unique_values('user_id'), "User IDs must be unique"))
validator.add_rule(Rule("Valid emails", matches_pattern('email', r'^[^@]+@[^@]+.[^@]+$'), "Invalid email format"))
validator.add_rule(Rule("Reasonable ages", between('age', 0, 120), "Age must be 0-120"))

# Run validation
issues = validator.validate(df)
for issue in issues:
    print(f"❌ {issue['rule']}: {issue['violations']} violations")

The obvious Lasting

This DSL, although simple, works because it aligns with how data scientists think about validation. Rules that Show Business Intelligence with Easy-to-Understand Requirements while empowering us to use pandas with efficiency and flexibility.

The classification of problems makes the verification of the logic checked and tested and continued. This method requires no external dependencies beyond pandas and introduces a learning curve for those already familiar with Pandas functionality.

This is something I've worked on over a few code sprints and several cups of coffee (Yes!). But you can use this version as a starting point and build something very cool. Entering codes!

Count Priya C is a writer and technical writer from India. He likes to work in the field of statistical communication, programming, data science and content creation. His areas of interest and expertise include deliops, data science and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he is working on learning and sharing his knowledge with the engineering community through tutorials, how-to guides, idea pieces, and more. Calculate and create resource views and code tutorials.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button