Think Your Python Code Is Running Slow? Stop Guessing and Start Measuring

nimda December 26, 2025

0 4 10 minutes read

Think Your Python Code Is Running Slow? Stop Guessing and Start Measuring

I was working on a script the other day, and it was driving me nuts. It worked, sure, but it was… a little. Really a little. I felt that this would be very fast if I could get it there the catch was.

My first thought was to start fixing things. I can improve data loading. Or write that again in a loop? But I caught myself. I've fallen into that trap before, spending hours “optimizing” a piece of code only to find that it makes no difference to the overall runtime. Donald Knuth had a point when he said, “Premature goodness is the root of all evil.”

I decided to take the direct route. Instead of guessing, I was going to find out for sure. I needed to profile the code to get hard data on which tasks were consuming the most clock cycles.

In this article, I will walk you through the exact process I used. We're going to take a deliberately slow Python script and use two nifty tools to identify its bottlenecks with surgical precision.

The first of these tools is called cProfilea powerful profiler built in Python. Another is called snakeviz, a brilliant tool that converts the profiler output into an interactive visual map.

Setting up the development environment

Before we start coding, let's set up our development environment. A good practice is to create a separate Python environment where you can install any necessary software and tests, knowing that whatever you do will not affect the rest of your program. I'll use conda for this, but you can use any method you're familiar with.

#create our test environment
conda create -n profiling_lab python=3.11 -y

# Now activate it
conda activate profiling_lab

Now that we have our environment set up, we need to install snakeviz for viewing and numpy for the example script. cProfile is already included with Python, so there is nothing else you can do there. Since we'll be running our scripts in Jupyter Notebook, we'll also include that.

# Install our visualization tool and numpy
pip install snakeviz numpy jupyter

Now type jupyter notebook in your command prompt. You should see jupyter notebook open in your browser. If that doesn't happen automatically, you'll probably see an information screen after the jupyter notebook commandment. Near the bottom of that, there will be a URL that you should copy and paste into your browser to launch Jupyter Notebook.

Your URL will be different from mine, but it should look like this:-

With our tools ready, it's time to look at the code we're going to modify.

Our “Problem” script

To properly test our profiling tools, we need a script that shows clear performance issues. I wrote a simple program that simulates processing problems with memory, iteration and CPU cycles, making it a perfect candidate for our investigation.

# run_all_systems.py
import time
import math

# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================

# --- Task 1: A Calibrated CPU-Bound Bottleneck ---
def cpu_heavy_task(iterations):
    print("  -> Running CPU-bound task...")
    result = 0
    for i in range(iterations):
        result += math.sin(i) * math.cos(i) + math.sqrt(i)
    return result

# --- Task 2: A Calibrated Memory/String Bottleneck ---
def memory_heavy_string_task(iterations):
    print("  -> Running Memory/String-bound task...")
    report = ""
    chunk = "report_item_abcdefg_123456789_"
    for i in range(iterations):
        report += f"|{chunk}{i}"
    return report

# --- Task 3: A Calibrated "Thousand Cuts" Iteration Bottleneck ---
def simulate_tiny_op(n):
    pass

def iteration_heavy_task(iterations):
    print("  -> Running Iteration-bound task...")
    for i in range(iterations):
        simulate_tiny_op(i)
    return "OK"

# --- Main Orchestrator ---
def run_all_systems():
    print("--- Starting FINAL SLOW Balanced Showcase ---")
    
    cpu_result = cpu_heavy_task(iterations=CPU_ITERATIONS)
    string_result = memory_heavy_string_task(iterations=STRING_ITERATIONS)
    iteration_result = iteration_heavy_task(iterations=LOOP_ITERATIONS)

    print("--- FINAL SLOW Balanced Showcase Finished ---")

Step 1: Collecting Data with cProfile

Our first tool, cProfile, is a custom profiler built in Python. We can run it from code to run our script and record detailed statistics about every function call.

import cProfile, pstats, io

pr = cProfile.Profile()
pr.enable()

# Run the function you want to profile
run_all_systems()

pr.disable()

# Dump stats to a string and print the top 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())

Here is the output.

--- Starting FINAL SLOW Balanced Showcase ---
  -> Running CPU-bound task...
  -> Running Memory/String-bound task...
  -> Running Iteration-bound task...
--- FINAL SLOW Balanced Showcase Finished ---
         275455984 function calls in 30.497 seconds

   Ordered by: cumulative time
   List reduced from 47 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   30.520   15.260 /home/tom/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
        2    0.000    0.000   30.520   15.260 {built-in method builtins.exec}
        1    0.000    0.000   30.497   30.497 /tmp/ipykernel_173802/1743829582.py:41(run_all_systems)
        1    9.652    9.652   14.394   14.394 /tmp/ipykernel_173802/1743829582.py:34(iteration_heavy_task)
        1    7.232    7.232   12.211   12.211 /tmp/ipykernel_173802/1743829582.py:14(cpu_heavy_task)
171796964    4.742    0.000    4.742    0.000 /tmp/ipykernel_173802/1743829582.py:31(simulate_tiny_op)
        1    3.891    3.891    3.892    3.892 /tmp/ipykernel_173802/1743829582.py:22(memory_heavy_string_task)
 34552942    1.888    0.000    1.888    0.000 {built-in method math.sin}
 34552942    1.820    0.000    1.820    0.000 {built-in method math.cos}
 34552942    1.271    0.000    1.271    0.000 {built-in method math.sqrt}

We have a bunch of numbers that can be hard to explain. This is where snakeviz comes in.

Step 2: Visualize the bottle with snakeviz

This is where the magic happens. Snakeviz takes the output of our profiling file and turns it into an interactive, browser-based chart, making it easy to find issues.

So let's use that tool to visualize what we have. Since I am using Jupyter Notebook, we need to load it first.

%load_ext snakeviz

And we run it like this.

%%snakeviz
main()

The output comes in two parts. First is a display like this.

Photo by the Author

What you see is an “icicle” chart going up and down. From top to bottom, it represents the call hierarchy.

Top: Python using our script ().

Next: __main__ script execution (:1()). Then the function run_all_systems. Within that, it calls two important tasks: iteration_heavy_task and cpu_heavy_task.

The memory-intensive processing component is not labeled in the chart. That is because the proportion of time associated with this task is much smaller than the times allocated to the other two intensive tasks. As a result, we see a very small, unlabeled block to the right of the cpu_heavy_task block.

Note that, for analysis, there is also a Snakeviz chart style called a Sunburst chart. It looks similar to a pie chart except that it contains a set of circles and arcs that get bigger and bigger. The idea is that the time the tasks take to execute is represented by the arc size of the arc size of the circle. The root function is a circle between viz. A root function works by calling its sub-functions and so on. We will not look at that type of display in this article.

Visual confirmation, like this, can be more impactful than staring at a table of numbers. I no longer needed to guess where I was looking; data was staring me in the face.

A view is immediately followed by a block of text that describes the timings of various parts of your code, such as the output of the cprofile tool. I'm only showing the first lines or so of this, as there were 30+ in total.

ncalls tottime percall cumtime percall filename:lineno(function)
----------------------------------------------------------------
1 9.581 9.581 14.3 14.3 1062495604.py:34(iteration_heavy_task)
1 7.868 7.868 12.92 12.92 1062495604.py:14(cpu_heavy_task)
171796964 4.717 2.745e-08 4.717 2.745e-08 1062495604.py:31(simulate_tiny_op)
1 3.848 3.848 3.848 3.848 1062495604.py:22(memory_heavy_string_task)
34552942 1.91 5.527e-08 1.91 5.527e-08 ~:0()
34552942 1.836 5.313e-08 1.836 5.313e-08 ~:0()
34552942 1.305 3.778e-08 1.305 3.778e-08 ~:0()
1 0.02127 0.02127 31.09 31.09 :1()
4 0.0001764 4.409e-05 0.0001764 4.409e-05 socket.py:626(send)
10 0.000123 1.23e-05 0.0004568 4.568e-05 iostream.py:655(write)
4 4.594e-05 1.148e-05 0.0002735 6.838e-05 iostream.py:259(schedule)
...
...
...

Step 3: Repair

Of course, tools like cprofiler and snakeviz don't tell you How to fix your performance issues, but now that I know exactly where the issues are, I can use guided fixes.

# final_showcase_fixed_v2.py
import time
import math
import numpy as np

# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================

# --- Fix 1: Vectorization for the CPU-Bound Task ---
def cpu_heavy_task_fixed(iterations):
    """
    Fixed by using NumPy to perform the complex math on an entire array
    at once, in highly optimized C code instead of a Python loop.
    """
    print("  -> Running CPU-bound task...")
    # Create an array of numbers from 0 to iterations-1
    i = np.arange(iterations, dtype=np.float64)
    # The same calculation, but vectorized, is orders of magnitude faster
    result_array = np.sin(i) * np.cos(i) + np.sqrt(i)
    return np.sum(result_array)

# --- Fix 2: Efficient String Joining ---
def memory_heavy_string_task_fixed(iterations):
    """
    Fixed by using a list comprehension and a single, efficient ''.join() call.
    This avoids creating millions of intermediate string objects.
    """
    print("  -> Running Memory/String-bound task...")
    chunk = "report_item_abcdefg_123456789_"
    # A list comprehension is fast and memory-efficient
    parts = [f"|{chunk}{i}" for i in range(iterations)]
    return "".join(parts)

# --- Fix 3: Eliminating the "Thousand Cuts" Loop ---
def iteration_heavy_task_fixed(iterations):
    """
    Fixed by recognizing the task can be a no-op or a bulk operation.
    In a real-world scenario, you would find a way to avoid the loop entirely.
    Here, we demonstrate the fix by simply removing the pointless loop.
    The goal is to show the cost of the loop itself was the problem.
    """
    print("  -> Running Iteration-bound task...")
    # The fix is to find a bulk operation or eliminate the need for the loop.
    # Since the original function did nothing, the fix is to do nothing, but faster.
    return "OK"

# --- Main Orchestrator ---
def run_all_systems():
    """
    The main orchestrator now calls the FAST versions of the tasks.
    """
    print("--- Starting FINAL FAST Balanced Showcase ---")
    
    cpu_result = cpu_heavy_task_fixed(iterations=CPU_ITERATIONS)
    string_result = memory_heavy_string_task_fixed(iterations=STRING_ITERATIONS)
    iteration_result = iteration_heavy_task_fixed(iterations=LOOP_ITERATIONS)

    print("--- FINAL FAST Balanced Showcase Finished ---")

Now we can restart cprofiler in our updated code.

import cProfile, pstats, io

pr = cProfile.Profile()
pr.enable()

# Run the function you want to profile
run_all_systems()

pr.disable()

# Dump stats to a string and print the top 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())

#
# start of output
#

--- Starting FINAL FAST Balanced Showcase ---
  -> Running CPU-bound task...
  -> Running Memory/String-bound task...
  -> Running Iteration-bound task...
--- FINAL FAST Balanced Showcase Finished ---
         197 function calls in 6.063 seconds

   Ordered by: cumulative time
   List reduced from 52 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    6.063    3.031 /home/tom/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
        2    0.000    0.000    6.063    3.031 {built-in method builtins.exec}
        1    0.002    0.002    6.063    6.063 /tmp/ipykernel_173802/1803406806.py:1()
        1    0.402    0.402    6.061    6.061 /tmp/ipykernel_173802/3782967348.py:52(run_all_systems)
        1    0.000    0.000    5.152    5.152 /tmp/ipykernel_173802/3782967348.py:27(memory_heavy_string_task_fixed)
        1    4.135    4.135    4.135    4.135 /tmp/ipykernel_173802/3782967348.py:35()
        1    1.017    1.017    1.017    1.017 {method 'join' of 'str' objects}
        1    0.446    0.446    0.505    0.505 /tmp/ipykernel_173802/3782967348.py:14(cpu_heavy_task_fixed)
        1    0.045    0.045    0.045    0.045 {built-in method numpy.arange}
        1    0.000    0.000    0.014    0.014 <__array_function__ internals>:177(sum)

That's a great result that shows the power of profiling. We spent our effort on important code parts. To be careful, I also used snakeviz in a static script.

%%snakeviz
run_all_systems()

The most noticeable change is the reduction of the total runtime, from about 30 seconds to about 6 seconds. This is a 5x speedup, achieved by addressing the three main constraints that were visible in the “before” profile.

Let's look at them one by one.

1. Heavy, repetitive work

Before (Problem)
In the first image, the large bar on the left, iteration_heavy_task, is one big, consuming bottleneck. 14.3 seconds.

Why was it slow? This work was a classic “death by a thousand cuts.” The function simulate_tiny_op did nothing, but was called millions of times inside a pure Python loop. The big overhead of the Python interpreter starting and stopping the function call over and over again was the whole source of the slowness.

Repair
A fixed version, iteration_heavy_task_fixed, realized that the goal could be reached without a loop. In our demo, this meant removing a completely pointless loop. In a real-world application, this would involve finding a single “bulk” function to replace the iterative one.

After (Result)
In the second picture, the iteration_heavy_task bar is completely. It is now so fast that its runtime is a fraction of a second and is not visible on the chart. We successfully completed the 14.3 second challenge.

2. cpu_heavy duty

Before (Problem)
The second big bottleneck, clearly visible as the big orange bar on the right, is cpu_heavy_task, which took 12.9 seconds.

Why was it slow? Like the iteration function, this function is also limited by Python's for loop speed. While internal math operations were fast, the interpreter had to process each number in the millions, which is very inefficient for numerical operations.

Repair
The fix was vectorization using the NumPy library. Instead of using a Python loop, cpu_heavy_task_fixed created a NumPy array and performed all math operations (np.sqrt, np.sin, etc.) on all members at once. These operations are performed in highly optimized, precompiled C code, completely bypassing the slow Python interpreter loop.

After (Result).
Like the first bottleneck, the cpu_heavy_task bar has disappeared from the “background” diagram. Its runtime has been reduced from 12.9 seconds to a few milliseconds.

3. Memory_heavy_string function

Before (Problem):
In the first diagram, the memory-heavy_string_task was running, but its running time was small compared to the other two major issues, so it was relegated to a small, unmarked space on the far right. It was a relatively small matter.

Repair
The fix for this job was a replacement that didn't work properly report += “…” String concatenation in the most efficient way: build a list of all parts of the string and call “”.join() one last time.

After (Result)
In the second picture, we see the result of our success. After finishing two bottles of 10+ second, memory-heavy-thread-task-fixed now the new ruling bottleaccounting for 4.34 seconds for a total runtime of 5.22 seconds.

Snakeviz also allows us to look inside this static function. The most important new contributor is the labeled orange bar (list of understanding), which takes 3.52 seconds. This indicates that even in the middle fixed code, the most time-consuming part is now the process of creating a wide array of strings in memory before they are assembled.

Summary

This article provides a practical guide to identifying and resolving performance issues with Python code, arguing that developers should use profiling tools to average working instead of relying on information or guesswork to identify the source of the decline.

I have demonstrated the workflow in a way that uses two key tools:-

cProfile: Python's built-in profiler, used to collect detailed data about function calls and execution times.
snakeviz: A visualization tool that turns cProfile data into an interactive “icicle” chart, making it easy to visually see which parts of the code are consuming the most time.

The article uses an example of a deliberately slow text with three distinct and important constraints:

Function bound for repetition: A function that is called millions of times in a loop, which shows the performance cost of a Python function calling more (“death by a thousand cuts”).
CPU bound task: A loop that performs millions of calculations, highlighting the inefficiency of pure Python for heavy numerical work.
Memory bound function: A large string that is arbitrarily constructed using repeated concatenation +=.

By analyzing the snakeviz output, I identified these three problems and applied the recommended fixes.

The iteration bottleneck was fixed by to remove the unnecessary loop.
The CPU bottleneck was solved by vectorisation using NumPy, which performs mathematical operations in fast, compiled C code.
The memory bottleneck was fixed by concatenating parts of the string into an array and using a single, effective “”..join() call.

This fix resulted in a dramatic speedup, reducing script execution time from the end 30 seconds until now 6 seconds. I conclude by showing that, even if the main problems have been solved, the profile can be used again for identification newsmall constraints, indicating that performance tuning is an iterative process guided by measurement.

Source link

nimda December 26, 2025

0 4 10 minutes read