Machine Learning

I thought Data Engineering was just Writing Documentation. I was wrong.

The ETL pipeline, I thought I had a good grasp of what data engineering really is. You extract data from somewhere, clean it up, upload it somewhere useful. ETL. Simple enough.

In context, I'm a data analyst trying to transition to a data engineer. I've been documenting that journey publicly, starting with the 12-month self-study guide I put together earlier this year. The latest step in that journey was building my first ETL pipeline from scratch using the GitHub API, which I wrote about here on TDS. That pipe worked. Pulled data, cleaned it, saved it to CSV. I was happy about it.

So I decided to go ahead and make it “production ready” as the internet likes to say. What happened next really surprised me. Not because things were broken, but because of what the brokenness revealed.

Original Pipe

The first pipe was basic, which was fine because it was the point. Extract data from the GitHub API, do a little cleaning, save everything to a CSV file. It worked very well for what it was: a learning activity. But a CSV file and a one-time script is not how data engineering works in the real world. I wanted to find out what the “real world” meant in practice, so I decided to run the pipeline further to see what happened.

Here's the actual complete pipe for anyone who hasn't read the previous article:

import requests
from datetime import datetime, timedelta

url = "
params = {
    "q": "language:python created:>2025-04-22",
    "sort": "stars",
    "order": "desc",
    "per_page": 30
}

response = requests.get(url, params=params)
data = response.json()

import pandas as pd
repos = []
for repo in data['items']:
    repos.append({
        "name": repo['name'],
        "owner": repo['owner']['login'],
        "stars": repo['stargazers_count'],
        "forks": repo['forks_count'],
        "language": repo['language'],
        "description": repo['description'],
        "url": repo['html_url'],
        "created_at": repo['created_at']
    })

df = pd.DataFrame(repos)
df_clean = df.dropna(subset=['description'])
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)

df_clean.to_csv('github_trending_repos.csv', index=False)
print("Pipeline complete. File saved.")

It's simple, readable, and effective. But when you try to run it more than once, or come back to it the next day, cracks start to appear.

Wall One: The Pipe Had No Memory

The first development was straightforward. Instead of saving to a CSV file, I uploaded the data to a SQLite database. SQLite is still just a single file, but it behaves like a real database. You can question it, check what is already in it, and build on it well. It felt like a small change. It wasn't like that.

I ran the pipeline once and got 22 repos. Then I ran it a second time without changing anything and checked the database.

Total rows: 44 Unique repos: 22 Duplicates: 22

Honestly, I didn't expect it. I thought it was possible but I never thought it would happen. But I'm glad, because it was the first time I watched my pipe break. And what it revealed was simple but important: the text had no memory. Every time it ran, it would start over completely and blindly paste anything it found. No warning, no error. Just “Pipe line complete” as if everything was fine.

This is where I discovered the concept called idempotency.
Idempotency is a fancy word for simple mindedness. If something has already happened, it should not happen again. In the context of a data pipeline, it means that running your pipeline once or running it ten times should always produce the same result. No extra lines, no duplication, no silent corruption of your data.

The adjustment was straightforward in principle. Before inserting anything into the database, the pipeline now checks to see if that record already exists. If it does, it removes it first, then installs the new version. One small change in thinking, but it completely changes how reliable your pipeline is.

And here's the part that stuck with me: the basic script would never have thought of this on its own. You have to build it intentionally. That is no longer writing. That's engineering.

Wall Two: Data Disappeared Overnight

The second wall was unprofessional and restless.

When I closed the Colab at night and came back the next day, there was this uneasy feeling. I had to run everything again from scratch and hope nothing was broken, even though everything had worked fine the day before. The database I had carefully built had just passed. And I remembered that before this project, I had actually struggled to get my first ETL pipeline file. I spent some time looking for it before I finally found it. That feeling of almost losing your job stays with you.

I knew there had to be a better way. A real pipeline cannot depend on someone to rerun it every morning. The data must live in some heavy space beyond the session.

The fix here was to mount Google Drive directly within Colab and point the database connection there instead of pointing to the temporary Colab environment. One line change:

conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')

The database now resides on Google Drive. Close the session, restart the runtime, open a completely new notebook. Data is pending.

But this correction reveals something great. If persistent data already needs to think about where things live and how they live, what if you need the pipeline to run automatically every day without you ever touching it?

Wall Three: No One Can Press Running Forever

The third wall was the one that really pleased me, surprised me.

Even if the duplication problem is solved and the database is safely in Google Drive, someone still has to open a notebook and press run. Colab is not a server. It is an interactive environment. It doesn't wake up at 3am, pull new data from the GitHub API, and go back to sleep. That's not what it was built for.

And when I thought about it from a real-world perspective, that clicked immediately. In a real company, no one is sitting there awake at midnight waiting to run a script. The pipeline should work for itself. It's on schedule. Honestly. Whether someone is watching or not.

This is where tools like Apache Airflow, Prefect, and cloud-based cron jobs come into play. These are not Python scripts. The systems reside on servers, manage schedules, catch failures, send alerts when something goes wrong, and keep a history of every run.

Planning is the concept I'm most excited to dig into next, because that's where data engineering starts to feel like real infrastructure work.

Navigating Between Major Changes

Let me walk through the three actual changes I made to the pipeline and what each one does.

1. Converting CSV to SQLite

# Before
df_clean.to_csv('github_trending_repos.csv', index=False)

# After
conn = sqlite3.connect('github_repos.db')
df_clean.to_sql('repos', conn, if_exists='append', index=False)
conn.close()

Saving to CSV is ideal for one-time analysis. But CSV is just a text file. You can't query it easily, and it doesn't scale well as your data grows. SQLite is a real database, which means you can run SQL queries on it, check what's already inside it, and build on top of it. Same simplicity, much more power.

2. To fix the duplicate problem

cursor.execute('''
    DELETE FROM repos WHERE url IN (SELECT url FROM repos_temp)
''')
cursor.execute('''
    INSERT INTO repos SELECT * FROM repos_temp
''')

This is an idempotency correction. Before installing anything, the pipeline checks if that repo already exists in the database using its URL as a unique identifier. If it does, it removes it first, then installs the new version. So no matter how many times the pipeline runs, you always end up with clean, non-duplicated data.

3. Persistent data on Google Drive

# Before
conn = sqlite3.connect('github_repos.db')

# After
conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')

This is one line but it changes everything. Instead of saving the database to Colab's temporary location where it disappears when the session expires, we save it directly to Google Drive. Close the notebook, restart the runtime, come back tomorrow. Your data is still there waiting for you.

Here is the full development pipeline that includes all three changes:

import requests
import pandas as pd
import sqlite3
from datetime import datetime, timedelta

# Extract
yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')

url = "
params = {
    "q": f"language:python created:>{yesterday}",
    "sort": "stars",
    "order": "desc",
    "per_page": 30
}

response = requests.get(url, params=params)
data = response.json()

# Transform
repos = []
for repo in data['items']:
    repos.append({
        "name": repo['name'],
        "owner": repo['owner']['login'],
        "stars": repo['stargazers_count'],
        "forks": repo['forks_count'],
        "language": repo['language'],
        "description": repo['description'],
        "url": repo['html_url'],
        "created_at": repo['created_at']
    })

df = pd.DataFrame(repos)
df_clean = df.dropna(subset=['description'])
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)

# Load
conn = sqlite3.connect('/content/drive/MyDrive/github_repos.db')
cursor = conn.cursor()

cursor.execute('''
    CREATE TABLE IF NOT EXISTS repos (
        name TEXT, owner TEXT, stars INTEGER, forks INTEGER,
        language TEXT, description TEXT, url TEXT,
        created_at TEXT, viral TEXT, loaded_at TEXT
    )
''')

df_clean['loaded_at'] = datetime.now().strftime('%Y-%m-%d')
df_clean.to_sql('repos_temp', conn, if_exists='replace', index=False)

cursor.execute('''
    DELETE FROM repos WHERE url IN (SELECT url FROM repos_temp)
''')
cursor.execute('''
    INSERT INTO repos SELECT * FROM repos_temp
''')

conn.commit()
conn.close()
print("Pipeline complete. Duplicates handled.")

So What Is Data Engineering, Really?

If you had asked me what data engineering was after building my first ETL pipeline, I would have said it was mostly about writing documentation. Extract, convert, upload. Repeat. That's what it looked like from the outside.

But after pushing that pipe forward and watching it break in three different ways, I think about it differently now. Data engineering is about building reliable systems, not just functional documentation. There is a difference. The script does what you say, and when you say it. The system handles failures, remembers what it has already done, persists data beyond a single session, and runs on a schedule without anyone watching.

Powerlessness, persistence, planning. None of these concepts came up while I was using my pipe once in a notebook. They only show up when I try to make it work as a reality.

And in a real company, you can't afford this wrong. The data your pipeline produces is used to make decisions. If it's full of duplicates, or disappears overnight, or only works when someone remembers to press a button, that's not a data pipeline. That is an obligation.

I still have a lot to learn. Planning the next wall I walk. But going into it now I know that data engineering was never just about writing documentation. I had to break a few things to figure that out.

This is an ongoing series on data engineering. Follow along as I document every step of the journey, including the parts that go wrong.

Connect with me on LinkedIn, YouTube, and Twitter.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button