I Built My First ETL Pipeline as a Complete Beginner. Here's the Way.

nimda May 25, 2026

0 9 6 minutes read

I Built My First ETL Pipeline as a Complete Beginner. Here's the Way.

of my data engineering journey series. In part one, I shared my 12-month guide to transitioning from data analyst to data engineer. This is where the real building begins.

When I published my first article about my data engineering journey, something unexpected happened. People are happy about it. They were strangers who came and said they were happy to follow. That was good.

But it came with pressure.

Suddenly this wasn't just my goal that I could leave behind when things got tough. People were watching. People were in the same boat. And that accountability is, frankly, part of why you're reading this right now.

So I had to move. And like anyone starting a new career, the first thing I did was look for resources. There are tons of online tutorials for data engineering. YouTube videos, tutorials, written guides. More than you can handle.

But I couldn't just apply the theory. I needed to build something. Something real, with real data, that finally worked.

So I closed the tutorial and opened the Google Colab notebook instead. I found the GitHub API documentation and decided to build my first ETL pipeline from scratch. No hand holding. Just me, some Python, and a goal.

This article is that experience fully documented. The code, the confusion, the small victories, and what I learned by doing.

First, what is ETL?

Before I get into what I built, let me quickly explain what ETL means because I had to look this up for myself not long ago.

ETL stands for Extract, Transform, Load. It is one of the most important concepts in data engineering.

Take it out it means going somewhere to get data. API, database, website, file. You extract raw information from the source.
Convert means cleaning and shaping that data. Removing bad rows, adding new columns, rearranging them to make them really useful.
Upload it means saving the cleaned data somewhere. Database, data warehouse, simple CSV file.

That's all. Those three steps, done in sequence, are the data pipeline. Everything else in data engineering, Airflow, Spark, Databricks, are very complex ways to do those three same things at scale.

I'm at the beginning of my road map, so I'll keep it simple. Pure Python, no orchestration tools yet. But the nature of the problem is the same.

The one that doesn't exist

I extracted data from the GitHub API, mainly the most starred Python repositories created in the last 30 days. Then I cleaned it up, added a new column, and saved the output as a CSV file.
It's easy. Of course. Totally mine.

Here's how it went.

Step 1: Extract

The first thing I had to do was figure out how to talk to the GitHub API. An API is a door that a company or platform opens so that developers can request data from it programmatically, without having to copy and paste anything.

GitHub has a free, public API. No account or paid plan is required for basic search.

Here is the code I wrote to extract the data:

import requests

url = "

params = {
    "q": "language:python created:>2025-04-22",
    "sort": "stars",
    "order": "desc",
    "per_page": 30
}

response = requests.get(url, params=params)
data = response.json()

print(response.status_code)
print(data.keys())

I will tell the truth. This episode confused me at first. I requests the library was new to me. I params dictionary about that q the syntax felt strange. I don't immediately know what it is .json() I did it because I needed to.

Let me backtrack a bit.

requests.get() the way you knock on GitHub's door and ask for something. I url is the address of what you are asking for. I
params the dictionary is the specific question you are asking. In this case: “give me Python repos, sorted by star, created after April 22nd, showing 30 results.”
.json() converts the GitHub response from raw text into a Python dictionary you can actually work with.

When I ran it, I got this:

200 
dict_keys(['total_count', 'incomplete_results', 'items'])

I 200 it means success. That's the internet's way of saying “your request worked.” If you see 403 or 404, something went wrong.
The dictionary has three keys. total_count tells you how many repos matched the search. incomplete_results tells you if GitHub should cut anything short. Again items that's where the real data resides.

Then I ran the second block to peek inside:

print("Total matches on GitHub:", data['total_count'])
print("Repos returned:", len(data['items']))

first_repo = data['items'][0]
print("nFirst repo name:", first_repo['name'])
print("Stars:", first_repo['stargazers_count'])
print("Language:", first_repo['language'])
print("URL:", first_repo['html_url'])

Output:

Total matches on GitHub: 9228201
Repos returned: 30

First repo name: skills
Stars: 139136
Language: Python
URL:

The first result was the Anthropic repo with 139k stars. Real data. Live. I was pulled by the code I wrote.

Extract is done.

Step 2: Convert

Now I had 30 repos sitting in the Python directory, each a nested dictionary with a bunch of fields. Much that I didn't need. The transformation step is where you take that raw, messy data and shape it into something clean and meaningful.

First I extracted the fields I only cared about and loaded them into a Pandas data frame:

import pandas as pd

repos = []

for repo in data['items']:
    repos.append({
        "name": repo['name'],
        "owner": repo['owner']['login'],
        "stars": repo['stargazers_count'],
        "forks": repo['forks_count'],
        "language": repo['language'],
        "description": repo['description'],
        "url": repo['html_url'],
        "created_at": repo['created_at']
    })

df = pd.DataFrame(repos)
df.head()

Seeing that data frame appear was the perfect “wow” moment. I went from a wall of JSON to a clean, readable table with labeled columns in a few rows.

After that I made three changes:

# Drop rows where description is missing
df_clean = df.dropna(subset=['description'])

# Add a viral flag for repos with over 50k stars
df_clean = df_clean.copy()
df_clean['viral'] = df_clean['stars'].apply(lambda x: 'Yes' if x > 50000 else 'No')

# Sort by stars descending
df_clean = df_clean.sort_values('stars', ascending=False).reset_index(drop=True)

print("Before cleaning:", len(df))
print("After cleaning:", len(df_clean))

Output:

Before cleaning: 30 
After cleaning: 29

One repo had no definition and crashed. The virus column appeared clean. The data is now sorted and organized.
The Transform is done.

Step 3: Upload

The last step. Take clean data and save it somewhere. I kept this simple and uploaded it to a CSV file:

df_clean.to_csv('github_trending_repos.csv', index=False)

print("Pipeline complete. File saved.")
print(f"{len(df_clean)} repos loaded into github_trending_repos.csv")

Output:

Pipeline complete. File saved.
29 repos loaded into github_trending_repos.csv

I extracted the file and opened it. A clean spreadsheet with 29 rows and 9 columns. Real GitHub data, configured and saved with a pipeline I built from scratch.

Load done.

How this really felt

Before this, whenever I wanted data to work with, I would look for a public data set that someone had already cleaned and uploaded. Kaggle, Google Dataset Search, anywhere. I have always been a consumer of data prepared by others.

This changed something for me.

The moment I realized I could just point Python to an API I was curious about and extract live data myself, the possibilities felt completely different. I am not limited to existing data sets. I can create a pipeline that creates a dataset.

That's a different kind of power. And it's one of the things that drew me to data engineering in the first place.

What's next

This pipe is simple in design. I'm at the beginning of my road map and won't pretend to be using Airflow or Spark yet. But the premise is real. Extract, Convert, Upload. It works. I built it. I understand it.

The next step is to make it stronger. Schedule it to run every day. Save the output to a SQLite database instead of a plain CSV. Start tracking how repos are trending over time.

And finally, organize everything with Airflow. But that's a future topic.

At the moment, the most important thing I have proven to myself is that building teaches you things you will never see. I spent weeks in the teaching world and I didn't budge. I spent one afternoon building it, and I understand ETL better than any video made it sound.

Stop watching. Start building.

This is part two of my ongoing data engineering series. Follow along as I document every step of the journey, including the parts that go wrong. Feel free to check out my in-depth take on ETL on my YouTube channel below.

Connect with me on LinkedIn, YouTube, and Twitter.

Source link

nimda May 25, 2026

0 9 6 minutes read