I tried to Edit My ETL Pipeline. Here's What I Didn't Expect.

0 2 6 minutes read

I tried to Edit My ETL Pipeline. Here's What I Didn't Expect.

I mentioned that planning is the next wall I will go through.

So I guessed, here I am, heading towards it

But before I get into what happened, let me provide some context for anyone stumbling across this for the first time.

I am a systems analyst who decided to switch to a data engineer. Instead of just taking courses and collecting certifications, I decided to learn about architecture and write about it publicly. All the articles in this series document something I actually built, decisions I made, things that broke, and what I learned from them.

The first article was my 12-month self-study guide, where I laid out a plan for how I was going to deal with this transition. The second was building my first ETL pipeline from scratch using the GitHub API, as a complete beginner. Thirdly, I took that pipeline and made it more production-ready by adding SQLite storage, redundancy management, and Google Drive persistence, all within Google Colab.

This is the fourth article. And it picks up where it left off.

I expected to spend most of my time choosing the editing tool and tweaking it. What I didn't expect was that before I could think about planning, I had to deal with something very important. My pipeline couldn't work without Google Colab. And until that changes, no editor in the world can help me.

This is the story of what really happened.

First Wall: My pipe used to be in Colab

Before I even got to planning, I wanted to understand exactly what I would need to automate my pipeline. So I took a good look at my code for the first time with that question in mind.

Here's what the loading stage looked like:

conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')

That way, /content/drive/MyDrive/only available within Google Colab. It's a built-in Google Drive option provided by Colab when you connect your Drive to the notebook. Without Colab, that way doesn't exist. If any programmer tries to run this script, it will crash on the spot.

Interestingly, my code didn't have it google.colab imported goods. There are no libraries specific to Colab. One hard-coded method that I was writing without really thinking about it. That method was a dependency, not a code.

This was the first thing I did not expect. I thought the challenge would be learning the programming tool. Instead, the first lesson was that my surroundings were part of my pipeline, and I hadn't noticed.

The fix was easy. Instead of hard coding the Colab method, I made the database method configurable with environment variables:

import os

DB_PATH = os.environ.get('DB_PATH', 'github_repos.db')
conn = sqlite3.connect(DB_PATH)

Now the script uses any method set in the environment. If nothing is set, it reverts to local creation github_repos.db file in the same folder. One change, and the pipe was no longer bound to Colab.

Running It Without Colab First

Before setting up any editor, I wanted to make sure the script ran itself. So I saved it as pipeline.pybuilt a requirements.txt with two libraries it needs:

requests
pandas

And run it from my terminal:

Print: Pipeline complete. Duplicates handled.

And the file called github_repos.db it appeared in my folder. The same pipeline I had been using in Colab now worked as a plain Python script, anywhere.

That sounded like a bigger deal than I expected. Not because the change was complicated, it wasn't. But because I realized that I was thinking of my pipeline as a notebook, what I had was a script that happened within one.

Choosing an Editing Tool

At this time I had a freelance writing. Now I needed something to run it on schedule.

I looked at a few options. APScheduler allows you to define schedules within your Python code, which run while the session is running but stop when you close your terminal. That's not really planning, that's just a loop. Airflow is the industry standard for programming pipelines, but requires a server, metadata database, and web interface. That's a lot of infrastructure for where I am right now.

GitHub Actions sits in the middle. It's free, it runs on GitHub servers, the system is defined in code, and it doesn't require me to maintain any infrastructure. The tradeoff is that it's designed for CI/CD workflows, not pipeline orchestration, so it has limitations around complex dependencies and monitoring. But with a pipe in my section, it's a viable option.

I also want to be honest: tools like Airflow exist for a reason. When the pipeline grows, when you have dependencies between tasks, when you need visibility into what's running and what's failed, you need to tune in correctly. GitHub Actions is not. But it's a good first step, and understanding why there is a limitation is part of learning what those most important tools solve.

Setting up GitHub Actions

GitHub Actions works with workflow files, which are YAML files that you place in a specific folder in your repository. The folder structure looks like this:

github-etl/
├── .github/
│   └── workflows/
│       └── schedule.yml
├── pipeline.py
└── requirements.txt

Here is the full workflow file I created:

name: Run ETL Pipeline

on:
  schedule:
    - cron: '0 9 * * *'
  workflow_dispatch:

jobs:
  run-pipeline:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run pipeline
        run: python pipeline.py

Let me go through what each part does.

cron: '0 9 * * *' the original program. Cron is a time-based task scheduling format that has existed in Unix systems for decades. The five numbers represent the minute, hour, day of the month, month, and day of the week. So 0 9 * * * means: at minute 0 of the 9th hour, every day, every month, every day of the week. In other words, 9am UTC every day.
workflow_dispatch add the trigger manually. This means you can also start a workflow with the click of a button on GitHub, without having to wait for a scheduled time. This is useful for testing.
runs-on: ubuntu-latest tells GitHub to spin up a new Linux machine with each run. Every time a workflow starts, GitHub creates a clean environment, installs your dependencies, runs your script, and closes everything. There is no persistent machine sitting somewhere running your code. It is ephemeral.

The steps are straightforward. Checkout pulls your code from the terminal and enters it into the runner. Python Setup installs the version you specify. The dependent installation is running pip install -r requirements.txt. Then the Run pipeline executes your script.

What happened when I ran it

After pushing the workflow file to GitHub, I went to the Actions tab in my repository and manually activated it using the task_dispatch button.

It ran. Twenty-seven seconds from start to finish. The pipeline pulled data from the GitHub API, transformed it, and loaded it into SQLite, all on the GitHub server, without me having to do anything after clicking a button.

I got one warning on the first run:

Node.js 20 actions are deprecated...

This was because I had used older versions of the checkout and setup-python actions. The fix was being updated actions/checkout@v3 to actions/checkout@v4 again actions/setup-python@v4 to actions/setup-python@v5. After that, the application is finished.

What I actually learned

Going into this, I thought planning was about choosing the right tool. What I found was that planning forced me to think about something I hadn't really thought about before: portability.

A pipe that only works in one place is not really a pipe. Platform-bound text. Getting it organized meant making it portable first, and making it portable meant understanding where it really depended.

The hard-coded approach was trivial. But catching it changed the way I think about writing pipeline code going forward. Every time I write a method or an assertion or an environment-specific value, I now ask whether that object will exist outside of the context in which I'm building it.

Another thing I've learned is that editing and singing are separate issues. GitHub Actions handles editing well. It doesn't handle things like retrying failed runs with rollback, warning if something went wrong, visualizing pipeline dependencies, or handling multiple pipeline dependencies. Those are orchestration problems, and what tools like Airflow are designed to solve.

I'm not there yet. But now I understand why those tools exist in a way that I didn't before.

What's Next

The pipeline now runs daily at 9am UTC. Data is collected. And I'm starting to notice something: when you have a pipeline that runs every day, you start to care about the data it produces differently.

Are all records clean? Are there repos coming in with missing fields? Does the viral flag really have a meaning, even if I defined it in a way that makes almost everything a “No”?

Those are questions of data quality. And they are the next wall I walk through.

This is part of my ongoing series documenting my transition from systems analyst to data engineer. If you've been following along, thank you. If this is your first article in the series, the previous ones are linked below.

From Data Analyst to Data Engineer: My 12-Month Self-Study Roadmap

I Built My First ETL Pipeline as a Complete Beginner. Here's the Way.

I thought Data Engineering was just Writing Documentation. I was wrong.

Connect with me on LinkedIn, YouTube, and Twitter.

Source link

nimda 1 week ago

0 2 6 minutes read