Machine Learning

From Data Analyst to Data Engineer: My 12-Month Self-Study Roadmap

. Part of me started this journey because data engineering is one of the hottest and highest paying jobs right now. I won't pretend that wasn't a thing.

But there is more to it than that.

I have been studying data analytics for a long time. SQL, Power BI, Python (Pandas, NumPy, a little Polars), data cleaning, EDA. You name it, I've had weed with it. And I truly enjoy it. But somewhere along the way, I became curious about what happens before the data lands on my desk. How is it going? Who built those pipes? What does the infrastructure behind all this look like?

That curiosity planted a seed.

Then AI started doing a lot of what I do quickly and easily. The best. But it also got me thinking: if AI can handle analytics, what is my edge? What can I build and understand more deeply? I work as an IT System Analyst at the beginning, and while I enjoyed the work, I realized that I wasn't challenging myself as much as I wanted to. I was ready for more.

The final push came from Data With Baraa's video, where he lays out a comprehensive guide to data engineering. Something about seeing it organized and broken down made it feel real and possible. So here I am.

I study data engineering in public. And this article is the beginning of that journey.

Also, just leaving a letter explaining that I have nothing to do with Data and Baraa. I just shared my personal journey. I hope it helps.

Why Data Engineering Specifically

I want to spend a moment here because I think this question deserves a real answer.

Data analysis has taught me how to work with data after it has arrived. Clean it, examine it, visualize it, get information from it. That skill set is really valuable. But the more I read, the more I kept hitting the same wall. The data I was working with had already been created and submitted by someone else. Someone built a pipe that brought it to me. Someone had decided how it was maintained, how it was built, how often it was renewed.

I wanted to be that person.

Data engineering lives up from statistics. It's about building systems that make analysis possible in the first place. Data pipelines, storage architecture, workflow orchestration, big data processing. These are the foundations on which everything else is built. And honestly, that kind of infrastructure work appeals to me in a way that pure analytics no longer does.

There is also an active contradiction. Data engineering roles are consistently ranked among the highest paying in the data industry. As AI tools get better at automating the analytics layer, the need for people who can build and maintain a reliable data infrastructure will only grow. I'd rather build pipes than just use them.

And one more thing. The startup I'm working on doesn't use any of the tools I'm going to learn. Which means that every hour I put into this is self-directed. There is no team to learn from, no projects to use. Just me, the internet, and whatever I can build on my own. That is a challenge I choose on purpose.

Why Am I Doing This In Public?

Writing about what I learn is something I believe in deeply. It forces you to understand something before explaining it. It keeps you accountable. And over time, it creates something that a single resume cannot do.

But I'm going to be honest about my fear​​​​​​​​​​​​​​, because I think that's the point of doing this publicly.

I have shiny object syndrome. There, I said it. I've dabbled in graphic design, animation, writing, marketing, and IT before coming to data. There is always something new and exciting that catches my attention. Data engineering can easily be replaced by the next flashy thing on my feed if I don't mean it.

Consistency is another thing. I work 9-5 where I barely touch the tools I'm going to learn. There is no environmental reinforcement at work, there is no colleague that I can dismiss the questions of Airflow. I built this entirely on my own time, outside of my work commitments.

And balance. Three to four hours a day is the goal. Some days that will feel easy. Some days it will feel impossible.

Publishing this trip is my accountability plan. If I am silent, you will know that I have slipped. And I prefer not to slip.

What I start with

I'm not starting from scratch, which helps. I already have the beginning of intermediate SQL knowledge from my data analysis work, the basics of Python, and some knowledge of using Pandas. That gives me a foundation to build on rather than rebuilding from scratch.

Here's the full learning stack, roughly the way I'll tackle it.

1. SQL: Deeper Than Statistics

I know SQL. But analytics SQL and engineering SQL are different animals. I'll delve into query development, indexing, work with very large data sets, and write SQL that's built to perform rather than just test it. If you've only ever used SQL to pull and filter data, there's another layer to understand.

Why first: Everything in data engineering ultimately involves SQL. Being sharp here before putting on more complicated tools makes the whole journey easier.

2. Python: From Testing to Production-Ready

I have the basics. Pandas, NumPy, some Polar. But the Python I've been writing lives mostly in notebooks. It's messy, it's messy, it's not built to last. The goal now is to write cleaner, more organized, reusable code. Functions, modules, error handling, scripting. The kind of Python you can actually put into a pipeline.

Why it matters: Python is the glue that holds many modern data engineering stacks together. Airflow uses it. PySpark is built on top of it. Comfort here is non-negotiable.

3. Git and GitHub: Version Control Done Right

I will tell the truth. My Git knowledge is currently “copy the command, hope it works.” That has to change. Version control is more important to working as a developer than just an analyst. I will learn how to branch, pull requests, and how to manage code properly across projects.

Why it matters: Every project I build from here goes forward on GitHub. It's a portfolio, it's a behavior, and it's how real teams work.

4. Apache Spark and PySpark: Big Data Processing

This is where things get really interesting. Apache Spark is one of the most widely used engines for large-scale data processing. PySpark has its own Python API, which means I can use a language I'm already somewhat familiar with to work with distributed data at scale.

Jumping from Pandas to Spark is a paradigm shift. Pandas work on a single machine. Spark is designed to run between clusters. Learning to think in that distributed way is one of the skills that separates data engineers from analysts.

Why it matters: If you want to work with big data in a production environment, Spark is almost unavoidable. It always appears in job descriptions and is the core of the Databricks ecosystem that I will be building on.

5. Apache Airflow: Orchestrating Data Pipelines

Data pipelines do not work by themselves. You need something to organize yourself, monitor yourself, and handle failure gracefully. This is where workflow music tools come in, and Airflow is my choice.

I have considered several options here. Databricks Workflows is great if you're already deep into the Databricks ecosystem. Azure Data Factory makes sense for heavy Azure environments. But Airflow is free, open source, cloud-agnostic, and widely used across the industry. It also teaches you the core concepts of orchestration in a way that transfers to other instruments. Starting with Airflow felt like the right call, especially since I'm trying to keep costs down.

Why it matters: Orchestration is what turns a collection of documents into a real pipeline. Understanding Airflow is understanding how production data is managed.

6. Information bricks: Data Platform

At some point you need to choose a data platform and delve into it. I go with Databricks. It's built on top of Spark, it's on-demand, and it has a free Community Edition that lets you practice without paying for cloud credits.

The other methods are strong as well. Snowflake is a clean, fast SQL environment that many companies love. BigQuery is Google's fully managed, serverless option and is great if you rely on Google Cloud. But Databricks sits at the intersection of big data, machine learning, and data engineering in a way that fits where I want to go. It makes a lot of sense for my goals.

Why it matters: Employers want you to have field experience. The depth of one is more important than knowing a little about all of them.

How I Build 12 Months

The honest answer is that this can take more than 12 months. And I'm fine with that. I'd rather take 15 months and understand what I'm doing than rush into 12 and come out weak on the basics.

A common approach is to go through each skill in turn and not progress until I build something out of what I just learned. Tutorials are fine for guidance but projects are where the real learning happens. My plan is to document each section here on Towards Data Science: concepts, projects, frustrations, and wins.

To track progress, I use the Notion roadmap from Data With Baraa as my backbone. It breaks down each skill into main topics and allows me to follow where I am without being overwhelmed by the full picture at once.

As for the time commitment, three to four hours a day is the goal. Some of that will be systematic learning. Others will build. Others will be writing about what I just read, which is its own kind of reading.

What Does Success Look Like

Finding a high-paying data engineering role is a goal. That's true and I won't dress it up.

But beyond that, I want to be an honest voice in this space. Someone who creates things worth talking about, writes a journey without sifting through the hard parts, and maybe makes the path clearer for someone who comes after me.

Writing and reading feed off each other. The portfolio becomes the evidence. Testimony makes a mark. That's the idea.

It starts today

This article is my first official date. I don't wait until I feel ready or until everything is perfectly planned. I'm starting now, writing as I go, and letting the process be public and ugly.

If you are somewhere on the same path. Whether you're an analyst thinking about engineering, in IT wondering what's next, or someone trying to build skills that hold value in an AI-accelerated world. Follow along.

I think we have a lot to talk about. I will also share what I learned on my YouTube channel. So feel free to subscribe below and follow along.


This is the first article in an ongoing series documenting my data engineering journey. I'll be posting regularly about my progress, the projects I'm building, and everything I'm learning along the way.

And if you want access to the Notion template, if you're on the same journey as me, you can access it here..

Follow along on my journey below.

YouTube

In the middle

LinkedIn

Twitter

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button