Docker for Python & Data Projects: A Beginner's Guide

Photo by the Author
# Introduction
Python and data projects have a dependency problem. Between Python versions, physical environments, system-level packages, and operating system differences, getting someone else's code to work on your machine can sometimes take longer than understanding the code itself.
Docker solves this by packaging your code and its entire environment – the Python version, dependencies, system libraries – into a single artifact called an image. From an image you can start parallel running containers on your laptop, a colleague's machine, and a cloud server. You stop correcting errors in the fields and start the posting process.
In this article, you will learn Docker through practical examples with a focus on data projects: embedding script, rendering a machine learning model with FastAPIintegrating a multi-service pipeline with Docker Composeand job scheduling with the cron container.
# What is required
Before working with the examples, you will need:
- Docker and Docker Compose installed on your operating system. follow the official installation guide of your platform.
- Familiarity with the command line and Python.
- Practice writing a Dockerfile, building an image, and running a container from that image.
If you'd like a quick refresher, here are a few articles to get you up to speed:
You don't need deep knowledge of Docker to follow along. Each example describes what happens as it progresses.
# Includes a Python Script with Pinned Dependencies
Let's start with the most common use case: you have a Python script and a requirements.txtand you want it to work reliably anywhere.
We will create a data cleaning script that reads the raw sales CSV file, removes duplicates, fills in missing values, and writes the cleaned version to disk.
// Project Planning
The project is organized as follows:
data-cleaner/
├── Dockerfile
├── requirements.txt
├── clean_data.py
└── data/
└── raw_sales.csv
// Script Writing
Here is the data cleanup script in use Pandas to do heavy lifting:
# clean_data.py
import pandas as pd
import os
INPUT_PATH = "data/raw_sales.csv"
OUTPUT_PATH = "data/cleaned_sales.csv"
print("Reading data...")
df = pd.read_csv(INPUT_PATH)
print(f"Rows before cleaning: {len(df)}")
# Drop duplicate rows
df = df.drop_duplicates()
# Fill missing numeric values with column median
for col in df.select_dtypes(include="number").columns:
df[col] = df[col].fillna(df[col].median())
# Fill missing text values with 'Unknown'
for col in df.select_dtypes(include="object").columns:
df[col] = df[col].fillna('Unknown')
print(f"Rows after cleaning: {len(df)}")
df.to_csv(OUTPUT_PATH, index=False)
print(f"Cleaned file saved to {OUTPUT_PATH}")
// Pinning Dependence
Pinning exact translations is important. Apart from it, pip install pandas may install different versions on different machines. Pinned versions ensure that everyone gets the same behavior. You can define specific translations in requirements.txt file like this:
pandas==2.2.0
openpyxl==3.1.2
// Defining a Dockerfile
This Dockerfile creates a small image, which stores the cleanup script repository:
# Use a slim Python 3.11 base image
FROM python:3.11-slim
# Set the working directory inside the container
WORKDIR /app
# Copy and install dependencies first (for layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the script into the container
COPY clean_data.py .
# Default command to run when the container starts
CMD ["python", "clean_data.py"]
There are a few things that need to be explained here. We use python:3.11-slim instead of the full Python image because it's much smaller and removes packages you don't need.
We copy requirements.txt before copying all the other code and this is intentional. Docker builds images in layers and maintains a repository for each. If you only change clean_data.pyDocker will not re-install all your dependencies on the next build. It also uses the cached pip layer and skips straight to copying your updated script. That small ordering decision can save you minutes of rebuild time.
// Building and Running
With the image built, you can use the container and mount your local data folder:
# Build the image and tag it
docker build -t data-cleaner .
# Run it, mounting your local data/ folder into the container
docker run --rm -v $(pwd)/data:/app/data data-cleaner
I -v $(pwd)/data:/app/data the flag raises your place data/ folder in the container /app/data. This is how the script reads your CSV and how the cleaned output is written back to your machine. Nothing is baked into the image and the data stays on your file system.
I --rm the tag automatically ejects the container after it is done. Since this is a one-time script, there is no reason to keep the default container dormant.
# Serving a Machine Learning Model with FastAPI
You've trained a model and want to make it available over HTTP so other services can send data and get predictions. FastAPI works well for this: it's fast, lightweight, and handles input validation with it Pydantic.
// Project Planning
The project separates the model artifact from the application code:
ml-api/
├── Dockerfile
├── requirements.txt
├── app.py
└── model.pkl
// Writing an Application
The following application loads the model once at startup and exposes a /predict conclusion:
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pickle
import numpy as np
app = FastAPI(title="Sales Forecast API")
# Load the model once at startup
with open("model.pkl", "rb") as f:
model = pickle.load(f)
class PredictRequest(BaseModel):
region: str
month: int
marketing_spend: float
units_in_stock: int
class PredictResponse(BaseModel):
region: str
predicted_revenue: float
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict", response_model=PredictResponse)
def predict(request: PredictRequest):
try:
features = [[
request.month,
request.marketing_spend,
request.units_in_stock
]]
prediction = model.predict(features)
return PredictResponse(
region=request.region,
predicted_revenue=round(float(prediction[0]), 2)
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
I PredictRequest section for installation verification. If someone sends a request with a missing field or string where a number is expected, FastAPI rejects it with an explicit error message before your model code executes. The model is loaded once at startup – not for every request – which keeps response times fast.
I /health endpoint is a small but important addition: Docker, load balancers, and cloud platforms use it to check if your service is up and running.
// Defining a Dockerfile
This Dockerfile bakes the model directly into the image so that the container is self-contained:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the model and the app together
COPY model.pkl .
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
I model.pkl baked into the image during construction. This means that the container is self-contained, and you don't need to mount anything when you run it. I --host 0.0.0.0 the flag tells The Uvicorn listening on all network interfaces inside the container, not just localhost. Without this, you will not be able to access the API from outside the container.
// Building and Running
Build the image and start the API server:
docker build -t ml-api .
docker run --rm -p 8000:8000 ml-api
Check it out with curl:
curl -X POST
-H "Content-Type: application/json"
-d '{"region": "North", "month": 3, "marketing_spend": 5000.0, "units_in_stock": 320}'
# Building a Multi-Resource Pipeline with Docker Compose
Real data projects rarely involve a single process. You may need a database, a script that loads data from it, and a dashboard that it reads from — all working together.
Docker Compose allows you to define and run multiple containers as a single application. Each service has its own container, but they all share a private network so they can talk to each other.
// Project Planning
The pipeline splits each service into its own sublists:
pipeline/
├── docker-compose.yml
├── loader/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── load_data.py
└── dashboard/
├── Dockerfile
├── requirements.txt
└── app.py
// Defining a Script File
This configuration file declares all three services and connects them with strings and health checks and variables for the shared URL environment:
# docker-compose.yml
version: "3.9"
services:
db:
image: postgres:15
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: analytics
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U admin -d analytics"]
interval: 5s
retries: 5
loader:
build: ./loader
depends_on:
db:
condition: service_healthy
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
dashboard:
build: ./dashboard
depends_on:
db:
condition: service_healthy
ports:
- "8501:8501"
environment:
DATABASE_URL: postgresql://admin:secret@db:5432/analytics
volumes:
pgdata:
// Writing an Upload Script
This script briefly waits for the database, then loads the CSV into the sales table using SQLAlchemy:
# loader/load_data.py
import pandas as pd
from sqlalchemy import create_engine
import os
import time
DATABASE_URL = os.environ["DATABASE_URL"]
# Give the DB a moment to be fully ready
time.sleep(3)
engine = create_engine(DATABASE_URL)
df = pd.read_csv("sales_data.csv")
df.to_sql("sales", engine, if_exists="replace", index=False)
print(f"Loaded {len(df)} rows into the sales table.")
Let's take a closer look at the Write file. Each service runs in its own container, but they're all on the same network managed by Docker, so they can reach each other using the service name as the hostname. The loader connects to db:5432 – and not a place to live – because db service name, and Docker automatically handles DNS resolution.
Health check up on PostgreSQL service is important. depends_on it only waits for the container to start, not for PostgreSQL to be ready to accept connections. A health check is used pg_isready to ensure that the database is up before the loader attempts to connect. I pgdata the volume is persisted to the database between runs; Stopping and restarting the pipeline will not delete your data.
// It Begins Everything
Bring all services in one command:
docker compose up --build
To stop everything, use:
# Scheduling Jobs with a Cron Container
Sometimes you need a script to work with a program. Maybe it downloads data from an API every hour and writes it to a database or file. You don't want to set up a full orchestration system like Airflow for something as simple as this. The cron container does the job cleanly.
// Project Planning
The project includes a crontab file alongside the script and Dockerfile:
data-fetcher/
├── Dockerfile
├── requirements.txt
├── fetch_data.py
└── crontab
// Writing a Download Script
This script is used Applications to call the API endpoint and save the results as a time-stamped CSV:
# fetch_data.py
import requests
import pandas as pd
from datetime import datetime
import os
API_URL = "
OUTPUT_DIR = "/app/output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"[{datetime.now()}] Fetching data...")
response = requests.get(API_URL, timeout=10)
response.raise_for_status()
data = response.json()
df = pd.DataFrame(data["records"])
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_path = f"{OUTPUT_DIR}/sales_{timestamp}.csv"
df.to_csv(output_path, index=False)
print(f"[{datetime.now()}] Saved {len(df)} records to {output_path}")
// Defining Crontab
The crontab schedules the script to run every hour and redirects all output to the log file:
# Run every hour, on the hour
0 * * * * python /app/fetch_data.py >> /var/log/fetch.log 2>&1
I >> /var/log/fetch.log 2>&1 component redirects both standard output and error output to a log file. This way you check what happens after the fact.
// Defining a Dockerfile
This Dockerfile installs cron, registers a schedule, and keeps it running in the foreground:
FROM python:3.11-slim
# Install cron
RUN apt-get update && apt-get install -y cron && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY fetch_data.py .
COPY crontab /etc/cron.d/fetch-job
# Set correct permissions and register the crontab
RUN chmod 0644 /etc/cron.d/fetch-job && crontab /etc/cron.d/fetch-job
# cron -f runs cron in the foreground, which is required for Docker
CMD ["cron", "-f"]
I cron -f the flag is important here. Docker keeps a container alive as long as its main process is running. If cron is running in the background (its default), the main process will exit immediately and Docker will stop the container. I -f flag keeps cron running in the foreground to keep the container alive.
// Building and Running
Build the image and start the container in isolated mode:
docker build -t data-fetcher .
docker run -d --name fetcher -v $(pwd)/output:/app/output data-fetcher
Check logs anytime:
docker exec fetcher cat /var/log/fetch.log
The output folder is installed on your local machine, so the CSV files stay in your file system even though the script is running inside the container.
# Wrapping up
I hope you found this Docker article useful. Docker doesn't have to be complicated. Start with the first example, replace your own script with your dependencies, and get comfortable with the build cycle. Once you do that, other patterns follow naturally. Docker is a good fit if:
- You need productivity zones for every machine or team member
- You share scripts or models with specific dependency requirements
- You build multi-resource systems that need to work together reliably
- You want to use anywhere without conflicting settings
That said, you don't always need to use Docker for all your Python work. It may be overkill if:
- You are only doing a quick, exploratory analysis for yourself
- Your script has no external dependencies beyond the standard library
- You are ahead of the project and your needs change quickly
If you'd like to continue, check out 5 Easy Steps to Mastering Data Science Docker.
Enjoy coding!
Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.



