Your First Job as a Data Engineer at a New Company? Make the ETL Pipeline Testable

0 3 7 minutes read

Your First Job as a Data Engineer at a New Company? Make the ETL Pipeline Testable

joining a new company as a data engineer. You inherit several ETL pipelines and are responsible for maintaining them. What do you think are the challenges of your job?

In general, you may face the following problems:

Incremental schema changes: Engineering teams may add or drop fields, change data types or rename columns. When the source schema changes unexpectedly, ETL operations can fail unexpectedly. To make matters worse, the pipeline silently loads corrupted or invalid values into downstream tables.
Data Quality Issues: Sometimes ETL jobs don't fail immediately, on the contrary, they run and finish with success status. However, the uploaded data is not correct, it contains duplicate or missing records.
Lack of documentation: Legacy pipelines may have little documentation, or existing documentation may be out of date. So you are not sure if they are compatible with the current business concept.
Volume growth and increased performance: Data volume increases as the business grows. An ETL pipeline developed for a small historical dataset can easily slow down, stall, or fail when processing large volumes.

An automated testing workflow can help you address the above issues. Why? Because structured workflows can help you understand all the important aspects of the ETL pipeline quickly: business logic, data transformation algorithms, data types, all the data issues that ETL pipelines need to solve. Test patterns are reusable—you don't have to design a new workflow every time you inherit a different ETL pipeline.

In today's article, I will focus on automated testing in data engineering, including environmental configuration and practical workflows. Finally, I'll also discuss how AI-assisted coding can speed up workflow and improve productivity.

Make the Environment Work

If you are building an automated testing workflow for the first time, setting up the environment may take some time. There are different tools and workflows for data engineers to set up a test environment. But if you follow my steps below, the process will be easy and smooth.

First, you only need to install 3 things: Docker Desktop, VS Code and Dev Containers Extension.

In your testing workflow, Docker will create lightweight, isolated, and repeatable test environments. It allows you to deploy virtual data infrastructure (for example, databases, data pipelines, and orchestration engines) directly on a local machine or within a Continuous Integration (CI) pipeline. With Docker, you can run your integration and data validation tests uniformly across platforms without polluting local operating systems.

Visual Studio Code (VS Code) is a central development environment for writing, debugging, implementing, and testing automated data pipelines. As a data engineer, you may have used it for some of your projects. You may be more familiar with PyCharm or IntelliJ IDEA. From my user experience perspective, I prefer VS Code because of its lightweight architecture, extension ecosystem, and hybrid notebook/scriptflow workflow. Native AI editors like Cursor and Windsurf are quickly gaining popularity among developers, which I will discuss more in the later part of this article.

I assume you already have python, poetry, and java installed. You can open your VS Code terminal, type the following scripts to check their versions and make sure they are updated. You can also install them under your terminal if you haven't already.

python --version

java -version

poetry --version

The Dev Container extension enables you to use a Docker container as a fully functional, scalable development environment. It scales locations across the group and allows for logic testing of data import on-premises without using cloud services. Installing the Dev Container is straightforward. You just need to open Extensions in VS Code – you can press Ctrl+Shift+X (Windows/Linux) or Cmd+Shift+X (Mac), then search for 'Dev Containers' in the search bar, and click “Install”.

But the Dev Containers extension doesn't know how to build your specific environment. It needs a 'guide'. The guide is .devcontainer folder and the devcontainer.json a file under the folder tells the Dev Container extension:

Which Docker image to download.
Which ports should be forwarded.
Which VS Code extensions will be included inside the container.

There are two ways to get it .devcontainer folder. If you are new to these tools, you can use VS Code default tool. When you select a Python or Engineering Data template, VS Code can automatically generate a folder. If you have more experience in such type of projects, you can also write it manually from scratch to meet the testing requirements of your team. I .devcontainer folder can be committed and pushed to Git, along with your source code and source data, which you prepare to check out.

To make your life easier, you can merge the Git repository and open that folder with VS Code.

git clone

The last step to stop is to reopen the container. Why is it important? Because when you click “Restart in Container”, VS Code restarts its backend engine. Launch a Docker container and paste your local project folder directly inside that container. Your source code and source data in this ETL pipeline are accessible from the Docker environment. You can run your tests securely in an isolated sandbox. Sound cool? Yes, now you have your environment configured and ready to start testing your ETL pipelines.

Let the Tests Tell You What the System Does

When I inherit an ETL pipeline that I don't know, my first question is not: “How does the code work?” Instead, I ask: “What behavior is expected of the system?” Tests often answer that question faster than source code.

Imagine that the company you are joining uses LLMs such as GPT-5.5, Claude 4.6 and Gemini 3 Pro and the finance team wants to track the use of AI across teams.

Pseudo sample data created by the author

The table above shows the part of the data in csv format that should be saved. Column names must match by replacing spaces with underscores so that downstream systems can refer to the fields consistently. For example. 'Model Name' should be 'Model_Name'. You got it ingest.py to define column configuration and data import functions and ai_cost_ingest.py to call these functions on a folder.

import logging
from typing import List

from pyspark.sql import SparkSession


def sanitize_columns(columns: List[str]) -> List[str]:
    return [column.replace(" ", "_") for column in columns]


def run(spark: SparkSession, ingest_path: str, transformation_path: str) -> None:
    logging.info("Reading text file from: %s", ingest_path)

    input_df = (
        spark.read.format("org.apache.spark.csv")
        .option("header", True)
        .csv(ingest_path)
    )

    renamed_columns = sanitize_columns(input_df.columns)

    ref_df = input_df.toDF(*renamed_columns)

    ref_df.write.parquet(transformation_path)

import logging
import sys

from pyspark.sql import SparkSession

from data_ingestions.ai_cost import ingest

LOG_FILENAME = "project.log"
APP_NAME = "AI_Cost Pipeline: Ingest"

if __name__ == "__main__":
    logging.basicConfig(filename=LOG_FILENAME, level=logging.INFO)
    logging.info(sys.argv)

    if len(sys.argv) != 3:
        logging.warning("Input source and output path are required")
        sys.exit(1)

    spark = SparkSession.builder.appName(APP_NAME).getOrCreate()
    sc = spark.sparkContext
    app_name = sc.appName
    logging.info("Application Initialized: " + app_name)
    input_path = sys.argv[1]
    output_path = sys.argv[2]
    ingest.run(spark, input_path, output_path)
    logging.info("Application Done: " + spark.sparkContext.appName)
    spark.stop()

You must understand the functions described first. You may ask: “What exactly is the right thing to do?” sanitize_columns() what did i do Does it handle lead spaces, trailing spaces and internal spaces?” With these questions in mind, you write code like this:

from data_ingestions.ai_cost import ingest

def test_should_sanitize_nothing() -> None:
    no_whitespace_columns = ["Model"]

    actual = ingest.sanitize_columns(no_whitespace_columns)
    expected = no_whitespace_columns
    assert expected == actual

def test_should_sanitize_whitespace_outside() -> None:
    no_whitespace_columns = [" Prompt Tokens "]

    actual = ingest.sanitize_columns(no_whitespace_columns)
    expected = ["_Prompt_Tokens_"]
    assert expected == actual

def test_should_sanitize_whitespace_in_between() -> None:
    no_whitespace_columns = ["Prompt Tokens"]

    actual = ingest.sanitize_columns(no_whitespace_columns)
    expected = ["Prompt_Tokens"]
    assert expected == actual

The code allows you to test the function of sanitize_columns() directly without launching Spark and processing files. It is an example of a unit testing.

Unit Testing

A unit test is designed to validate a small piece of logic in isolation. They are usually fast, decisive and independent of external systems.

Integration tests

Unit testing tells if a small piece of logic is working properly. But they cannot answer this question: “Do all pipes work when all parts are connected together?”

For a data engineer, this usually means:

Reading files
Starting Spark
Effective changes
Writing output
To confirm the results

To check the whole pipe, we need integration testingwhich expresses the behavior of the system. Integration testing is very helpful during onboarding because it defines what the system is supposed to do, regardless of how the implementation evolves over time.

Of course AI_cost data import project, you can use integration testing to help ensure that:

Inputs come as CSV files.
Spark is used for data processing.
Column names are cleared.
Data values remain unchanged.
The output is written in Parquet format.
The complete import workflow should be successful.

import csv
import os
import tempfile
from pathlib import Path
from typing import List, Tuple

from pyspark.sql import SparkSession

from data_ingestions.ai_cost import ingest

def test_should_sanitize_column_names(
    spark_session: SparkSession,
) -> None:

    given_ingest_folder, given_transform_folder = (
        __create_ingest_and_transform_folders()
    )

    input_csv_path = given_ingest_folder + "input.csv"

    csv_content = [
        [
            "Model Name",
            "Prompt Tokens",
            " Completion Tokens "
        ],
        [
            "GPT-5.5",
            "1200",
            "300"
        ],
        [
            "Gemini 3 Pro",
            "900",
            "250"
        ],
    ]

    __write_csv_file(input_csv_path, csv_content)

    ingest.run(
        spark_session,
        input_csv_path,
        given_transform_folder
    )

    actual = spark_session.read.parquet(
        given_transform_folder
    )

    expected = spark_session.createDataFrame(
        [
            ["GPT-5.5", "1200", "300"],
            ["Gemini 3 Pro", "900", "250"]
        ],
        [
            "Model_Name",
            "Prompt_Tokens",
            "_Completion_Tokens_"
        ]
    )

    assert expected.collect() == actual.collect()

Let AI Learn the ETL Pipeline Before Execution

Imagine you are reviewing a non-standard ETL pipeline that contains hundreds or thousands of lines of PySpark code. Understanding the code and writing tests can take hours or days. Today, tools like Cursor, Windsurf, and GitHub Copilot can help speed up this process.

Take Cursor as an example. As an AI assistant, it can analyze the entire repository and generate descriptions of individual modules, functions, and data flows. It can also generate initial versions of unit tests and integration tests. To maximize its productivity, you need to ask the right questions as a data engineer. Here are some sample questions you can ask:

What is the purpose of this ETL function?
What input and output formats does this pipeline expect?
What functions are responsible for data validation?
What cases have not been investigated yet?

AI can suggest test cases, but it cannot determine whether those tests meet the business needs and strategies of the company. Understanding the pipeline, validating assumptions, and reviewing code is still your responsibility. AI accelerates manufacturing rather than replacing engineering judgment. Save your time understanding and testing the ETL pipeline so you can focus on high-value data engineering work like designing data structures, building scalable data platforms, and enabling data-driven decision making.

Source link

nimda 3 weeks ago

0 3 7 minutes read