Create a pipe to clean data and valadation pipeline under 50 Python lines


Photo by writer | Ideogram
Data is unclean. So when you pull information from apis, analyze Real-World Dates, and the like, you'll be able to run with irresisti, missing amounts, and incorrect. Instead of writing the same durable cleaner, well-designed cleaner pipe saves time and confirms the consensus of your data science projects.
In this article, we will build practical data and pipeline control over quality matters while giving a detailed response to what is prepared. At the end, you will have a tool that can clean the datasets and let it against business rules with a few lines of the code.
🔗 Link to the code in Githubub
Why clean data pipes?
Consider the data pipes such as the convention rows in making. Each step is doing some work, and the outgoing from one step is the next content. This method makes your code more effective, tested, and also works well in all different projects.


A simple pipe for data cleaning
Photo by writer | Drawings.net (DraW.IO)
Our pipe will handle three basic responsibilities:
- Cleaning: Delete twice and handle missing prices (use this as the first point. You can add many cleanup measures as needed.)
- Verification: Make sure the data meet business rules and problems
- Reporting: Track what changes are made at the time of processing
Setting up the environmental environment
Please make sure you use the latest version of Python. If you use locally, create visual nature and enter the required packages:
You can also use Google Colab or similar locations when you choose.
To explain the confirmation schema
Before we can confirm the data, we need to explain what “valid” looks. We will use the pydantic, the Python library that uses the plans for the summer to ensure data types.
class DataValidator(BaseModel):
name: str
age: Optional[int] = None
email: Optional[str] = None
salary: Optional[float] = None
@field_validator('age')
@classmethod
def validate_age(cls, v):
if v is not None and (v < 0 or v > 100):
raise ValueError('Age must be between 0 and 100')
return v
@field_validator('email')
@classmethod
def validate_email(cls, v):
if v and '@' not in v:
raise ValueError('Invalid email format')
return v
This schema updates the expected data using Pydantic's Syntax. To use the @field_validator Decoration, you will need @classmethod the decoration. A logic verification guarantees age is less than appropriate limits and emails containing the sign of '@'.
To create a pipe class
Our primary class includes all the concepts of cleaning and verification:
class DataPipeline:
def __init__(self):
self.cleaning_stats = {'duplicates_removed': 0, 'nulls_handled': 0, 'validation_errors': 0}
The builder begins a mathematic dictionary to track changes in the process of processing. This helps to look closely with the quality of data and keeps track of cleaning measures used later.
To write data cleaning data
Let us add a clean_data How to hold standard quality news of quality quality such as lost prices and double records:
def clean_data(self, df: pd.DataFrame) -> pd.DataFrame:
initial_rows = len(df)
# Remove duplicates
df = df.drop_duplicates()
self.cleaning_stats['duplicates_removed'] = initial_rows - len(df)
# Handle missing values
numeric_columns = df.select_dtypes(include=[np.number]).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('Unknown')
This method is smart by treating different types of data. Missing numbers are filled with median (power more than meaning sellers), and text columns receive the number of management. The double removal happens to avoid looking at our statistics.
Adding Verification in terms of error
The verification initiative works for each line individually, collecting all the approved information and detailed error details:
def validate_data(self, df: pd.DataFrame) -> pd.DataFrame:
valid_rows = []
errors = []
for idx, row in df.iterrows():
try:
validated_row = DataValidator(**row.to_dict())
valid_rows.append(validated_row.model_dump())
except ValidationError as e:
errors.append({'row': idx, 'errors': str(e)})
self.cleaning_stats['validation_errors'] = len(errors)
return pd.DataFrame(valid_rows), errors
This line of line is guaranteed that a particular global record does not drive the whole pipe. The valid lines continued through the process when errors are captured to review. This is important in production locations where you need to process what you can do while using problems.
Packing
This page process How to tie everything together:
def process(self, df: pd.DataFrame) -> Dict[str, Any]:
cleaned_df = self.clean_data(df.copy())
validated_df, validation_errors = self.validate_data(cleaned_df)
return {
'cleaned_data': validated_df,
'validation_errors': validation_errors,
'stats': self.cleaning_stats
}
The amount of refund is a broad statement that includes the purified data, any any verification errors, and processing statistics.
To put everything together
Here's how you used the pipe in working:
# Create sample messy data
sample_data = pd.DataFrame({
'name': ['Tara Jamison', 'Jane Smith', 'Lucy Lee', None, 'Clara Clark','Jane Smith'],
'age': [25, -5, 25, 35, 150,-5],
'email': ['[email protected]', 'invalid-email', '[email protected]', '[email protected]', '[email protected]','invalid-email'],
'salary': [50000, 60000, 50000, None, 75000,60000]
})
pipeline = DataPipeline()
result = pipeline.process(sample_data)
The pipe removes a double record, handles the lost name by completing it 'Unknown', filling the Median price verification, and incorrect email errors.
🔗 You can find the perfect text in GitHub.
Extend the pipe
The Pipeline works as a basis for which you can build. Consider these enhancements about specific needs:
Custom cleaning rules: Enter methods of extraction domain as phone numbers or addresses.
Confirmation Confirmation: Make Pydantic Schema prepared so the same pipe can handle different types of data.
Advanced error management: Use a temporary default reply or default modifications of normal errors.
Efficiency: For detail, think of using the activities underlined or the same processing.
Rolling up
The data pipes are not in relation to cleaning individual datasets. About creating reliable, effective programs.
This pipeline assures consistency in all your projects and makes it easy to address business rules as the requirements of change. Start with this basic Pipeline, and customize specific needs.
The key has a reliable, effective system that can host the activities of Mode so that you can focus on finding understanding from clean data. Happy Data Cleaning!
Count Priya c He is the writer and a technical writer from India. He likes to work in mathematical communication, data science and content creation. His areas of interest and professionals includes deliefs, data science and natural language. She enjoys reading, writing, coding, and coffee! Currently, he works by reading and sharing his knowledge and engineering society by disciples of teaching, how they guide, pieces of ideas, and more. Calculate and create views of the resources and instruction of codes.



