Machine Learning

How to use simple data contracts in Python for a data scientist

Let's Be Faithful: We are all here.

It's Friday afternoon. Trained the model, validated it, and deployed the calibration pipeline. The metrics look green. You close your laptop on the weekend, and enjoy the break.

On Monday morning, you are greeted with a message “Pipe Failed” when you enter work. What's going on? Everything was perfect when writing about the approval pipeline.

The truth is that an argument can be many things. Maybe the Upstream Engineering team has changed user_id column from the number in the string. Or maybe price Suddenly it always contains negative numbers. Or my personal favorite: The name of the column was changed from created_at at createdAt (Camelcase strikes again!).

The industry calls this Schema Drift. I memorize it.

Recently, people are talking a lot about Data contracts. Often, this involves selling you an expensive SAAS platform or complex architecture. But if you're just a data scientist or engineer trying to keep your Python pipelines from exploding, you don't need Enterprise Bloat.


Tool: PANPERA

Let's go through how to create a simple data contract in Python using the library Ishuda. It is an open source Python library that allows you to define Schemas as class objects. It sounds very similar to Pydantic (if you've used Fastapi), but it's built specifically for DataFrames.

To get started, you can simply install pandera Using PIP:

pip install pandera

Real life example: Marketing leads to feed

Let's look at the old situation. You upload a CSV file of advertising leads to a third party vendor.

Here it is wait The data looks like:

  1. water: Number (must be unique).
  2. the gospel: String (must actually look like an email).
  3. Register_date: A valid datetime object.
  4. Lead_Score: Float between 0.0 and 1.0.

Here is the dirty truth of our raw data we receive:

import pandas as pd
import numpy as np

# Simulating incoming data that MIGHT break our pipeline
data = {
    "id": [101, 102, 103, 104],
    "email": ["[email protected]", "[email protected]", "INVALID_EMAIL", "[email protected]"],
    "signup_date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"],
    "lead_score": [0.5, 0.8, 1.5, -0.1] # Note: 1.5 and -0.1 are out of bounds!
}

df = pd.DataFrame(data)

If you feed this data set into a model that expects to get a score between 0 and 1, your predictions will be garbage. If you tried to join id And there was double, your line count will explode. Dirty data leading to dirty data science!

Step 1: Define the contract

Instead of writing a dozen if data quality assessment statements, we define a Schemamodel. This is our contract.

import pandera as pa
from pandera.typing import Series

class LeadsContract(pa.SchemaModel):
    # 1. Check data types and existence
    id: Series[int] = pa.Field(unique=True, ge=0) 
    
    # 2. Check formatting using regex
    email: Series[str] = pa.Field(str_matches=r"[^@]+@[^@]+.[^@]+")
    
    # 3. Coerce types (convert string dates to datetime objects automatically)
    signup_date: Series[pd.Timestamp] = pa.Field(coerce=True)
    
    # 4. Check business logic (bounds)
    lead_score: Series[float] = pa.Field(ge=0.0, le=1.0)

    class Config:
        # This ensures strictness: if an extra column appears, or one is missing, throw an error.
        strict = True

See the code above for a general idea of ​​how Pandera sets up a contract. You can worry about the details later when looking at the Pandera documentation.

Step 2: Confirm the contract

Now, we need to apply the contract we created to our data. A foolproof way to do this is to run LeadsContract.validate(df). This works, but it fades – head An error occurs. In production, you are often curious everything That's wrong with the file, not just the first line.

We can allow “lazy” validation to catch all errors at once.

try:
    # lazy=True means "find all errors before crashing"
    validated_df = LeadsContract.validate(df, lazy=True)
    print("Data passed validation! Proceeding to ETL...")
    
except pa.errors.SchemaErrors as err:
    print("⚠️ Data Contract Breached!")
    print(f"Total errors found: {len(err.failure_cases)}")
    
    # Let's look at the specific failures
    print("nFailure Report:")
    print(err.failure_cases[['column', 'check', 'failure_case']])

Output

If you use the code above, you will not get the normal type KeyError. You will receive a specific message explaining exactly why the contract was broken:

⚠️ Data Contract Breached!
Total errors found: 3

Failure Report:
        column                 check      failure_case
0        email           str_matches     INVALID_EMAIL
1   lead_score   less_than_or_equal_to             1.5
2   lead_score   greater_than_or_equal_to         -0.1

In a more extreme case, you can log into the file and set up alerts to be notified of a broken item.


Why is this important?

This method changes the dynamics of your work.

Without a contract, your code fails deep inside the Transformation Logic (or worse, it doesn't fail, and you write bad data to the Warehouse). He spends many hours NaN values.

By contract:

  1. Quick failure: The pipe stops at the door. Bad data doesn't fit your basic logic.
  2. SULME CRES: You can send that dummy back to the data provider and say, “Rows 3 and 4 violate the schema. Please fix.”
  3. Documents: This page LeadsContract The class acts as a living document. Integrators new to the project don't have to guess what the columns represent; they can just read the code. You also avoid setting up a separate data contract in SharePoint, integration, or where that expires quickly.

“Good” Solution

You can definitely go deeper. You can combine this with Air flowPush metrics to the dashboard, or use similar tools Great_Pectations to find complex mathematical information.

But in 90% of the use cases I see, a simple validation step at the beginning of your Python script is enough to sleep well on a Friday night.

Start small. Define your best data schema, wrap it in a try/hold block, and see how many headaches it saves you this week. When this simple method is no longer suitable, then I will look for tools that are more specific to the data contacts.

If you are interested in AI, data science, or data engineering, please follow me or connect on LinkedIn.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button