Make python up to 150× faster with c

before long, you'll hit a block when it comes to your code execution speed. If you've ever written a fancy heavy algorithm in Python, like string spacing, matrix math, or cryptographic washing, you'll know what I'm talking about.
Of course, there are times when external libraries like profit can help you, but what happens when the algorithm is there It's a good sequence? That was exactly my problem when I wanted to take a closer look at a particular algorithm, which determines the number of edits needed to convert one string to another.
I tried Python. I tried the profit. And then I turned to C, a language I first learned in college decades ago but haven't used in anger for about 15 years. This is where things get interesting.
I first had to answer this question, “Can you call C from Python?”. After some research, it quickly became clear that the answer was yes yes. In fact, it turns out that you can do it in several ways, and in this article, I will look at the three most common ways to do so.
From the simplest to the most complex, we will look at using,
- subprocess
- ctypes
- Python C Extensions
The algorithm that will be tested is called levenshtein distance (ld) algorithm. The leventein distance between two words is the minimum number of single letter rearrangements (insertions, deletions or substitutions) required to change one word. It is named after the Soviet Mathematician Vladimir Levenshtein, who defined the metric in 1965. There are applications in various tools, such as spell checkers and visual character recognition programs.
To give you a clear picture of what we are talking about, here are a few examples.
Calculate the LD between the words “Book” and “Black”.
- Book → Baok (substituting “A” for “O”),
- Baok → back (substituting “C” for “O”)
- Back → Black (add letter “l”)
So, the LD in this case is three.
Calculate the LD between the words “Superb” and “Super”.
- Superb → Super (Remove the letter “b”)
LD in this case is just one.
We will code the LD algorithm in Python and C, and set up benchmarks to test how long it takes to run it using Python code that takes time in Python.
Requirements
Since I was running this on MS Windows, I needed a way to Compile C programs. The easiest way I could find to do this was to download the visual studio tools for building 2022. This allows you to compile c programs from the command line.
To install, first go to the Visual Studio download page. On the second screen, you will see a search box. Typing “Build Tools” In the search field and click search. The search should return a screen that looks like this,
Click the download button and follow any installation instructions. Once installed, in the DOS prompt window, when you click the Plus button to open a new terminal, you should see the option to open “developer command to update VS 2022”.

Most of my Python code will be running in Jupyter Notebook, so you should set up a new development environment and install jupyter. Do that now if you want to follow. I'm using the UV tool for this part, but feel free to use whatever method you're most comfortable with.
c:> uv init pythonc
c:> cd pythonc
c:> uv venv pythonc
c:> source pythonc/bin/activate
(pythonc) c:> uv pip install jupyter
LD algorithm in c
We need different types of LD algorithm in C, depending on the method used to call it. This is a version of our first example, where we use subprocessing to call C to execute.
1 / subprocessing: lev_sub.c
#include
#include
#include
static int levenshtein(const char* a, const char* b) {
size_t n = strlen(a), m = strlen(b);
if (n == 0) return (int)m;
if (m == 0) return (int)n;
int* prev = (int*)malloc((m + 1) * sizeof(int));
int* curr = (int*)malloc((m + 1) * sizeof(int));
if (!prev || !curr) { free(prev); free(curr); return -1; }
for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
for (size_t i = 1; i <= n; ++i) {
curr[0] = (int)i; char ca = a[i - 1];
for (size_t j = 1; j <= m; ++j) {
int cost = (ca == b[j - 1]) ? 0 : 1;
int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
}
int* tmp = prev; prev = curr; curr = tmp;
}
int ans = prev[m]; free(prev); free(curr); return ans;
}
int main(int argc, char** argv) {
if (argc != 3) { fprintf(stderr, "usage: %s n", argv[0]); return 2; }
int d = levenshtein(argv[1], argv[2]);
if (d < 0) return 1;
printf("%dn", d);
return 0;
}
To compile this, start a new command prompt for VS Code 2022 and type the following to ensure that we optimize the integration of 64 architectures.
(pythonc) c:> "%VSINSTALLDIR%VCAuxiliaryBuildvcvarsall.bat" x64
Next, we can compile our C code using this command.
(pythonc) c:> cl /O2 /Fe:lev_sub.exe lev_sub.c
That will create a fake file.
Estimating the underlying code
In the jupyter notebook, type the following code, which will be frequently addressed throughout our code. It generates random substrings of length n and calculates the number of permutations required to modify the string string1.
# Sub-process benchmark
import time, random, string, subprocess
import numpy as np
EXE = r"lev_sub.exe"
def rnd_ascii(n):
return ''.join(random.choice(string.ascii_lowercase) for _ in range(n))
def lev_py(a: str, b: str) -> int:
n, m = len(a), len(b)
if n == 0: return m
if m == 0: return n
prev = list(range(m+1))
curr = [0]*(m+1)
for i, ca in enumerate(a, 1):
curr[0] = i
for j, cb in enumerate(b, 1):
cost = 0 if ca == cb else 1
curr[j] = min(prev[j] + 1, curr[j-1] + 1, prev[j-1] + cost)
prev, curr = curr, prev
return prev[m]
The following is the actual benchmark code and the results of the run. To run the C Codep, we update a subprocess that outputs the compiled C Codep file we created earlier and compares the time required to run it, comparing it to the pure Python method. We run each method against 2000 and 4000 sets of random words three times and take the fastest of those times.
def lev_subprocess(a: str, b: str) -> int:
out = subprocess.check_output([EXE, a, b], text=True)
return int(out.strip())
def bench(fn, *args, repeat=3, warmup=1):
for _ in range(warmup): fn(*args)
best = float("inf"); out_best = None
for _ in range(repeat):
t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
if dt < best: best, out_best = dt, out
return out_best, best
if __name__ == "__main__":
cases = [(2000,2000),(4000, 4000)]
print("Benchmark: Pythonvs C (subprocess)n")
for n, m in cases:
a, b = rnd_ascii(n), rnd_ascii(m)
py_out, py_t = bench(lev_py, a, b, repeat=3)
sp_out, sp_t = bench(lev_subprocess, a, b, repeat=3)
print(f"n={n} m={m}")
print(f" Python : {py_t:.3f}s -> {py_out}")
print(f" Subproc : {sp_t:.3f}s -> {sp_out}n")
Here are the results.
Benchmark: Python vs C (subprocess)
n=2000 m=2000
Python : 1.276s -> 1768
Subproc : 0.024s -> 1768
n=4000 m=4000
Python : 5.015s -> 3519
Subproc : 0.050s -> 3519
That's a huge improvement in C runtime over Python.
2. Ctypes: lev.c
Ctypes are foreign worker interface (FFI) The Library is built directly on the Python standard library. Allows you to load and call functions from shared libraries written in C (DLLS on Windows, .so Linux files, .dylib on macos) directly from pythonwithout needing to write a complete extension module in C.
First, here is our C version of the LD algorithm, using Ctypes. It's almost like our Subprocess C function, with a line entry that lets us use Python to call the DLL after it's compiled.
/*
* lev.c
*/
#include
#include
/* below line includes this function in the
* DLL's export table so other programs can use it.
*/
__declspec(dllexport)
int levenshtein(const char* a, const char* b) {
size_t n = strlen(a), m = strlen(b);
if (n == 0) return (int)m;
if (m == 0) return (int)n;
int* prev = (int*)malloc((m + 1) * sizeof(int));
int* curr = (int*)malloc((m + 1) * sizeof(int));
if (!prev || !curr) { free(prev); free(curr); return -1; }
for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
for (size_t i = 1; i <= n; ++i) {
curr[0] = (int)i;
char ca = a[i - 1];
for (size_t j = 1; j <= m; ++j) {
int cost = (ca == b[j - 1]) ? 0 : 1;
int del = prev[j] + 1;
int ins = curr[j - 1] + 1;
int sub = prev[j - 1] + cost;
int d = del < ins ? del : ins;
curr[j] = d < sub ? d : sub;
}
int* tmp = prev; prev = curr; curr = tmp;
}
int ans = prev[m];
free(prev); free(curr);
return ans;
}
When using Ctypes to call C in Python, we need to convert our C code into a Dynamic Link Library (DLL) instead of an executable. Here is the build command you need for that.
(pythonc) c:> cl /O2 /LD lev.c /Fe:lev.dll
Standardizing Ctypes code
I leave the LEV_PY and rnd_ascui The Python functions in this Code Snippet are similar to those in the previous example. Type this in your notebook.
#ctypes benchmark
import time, random, string, ctypes
import numpy as np
DLL = r"lev.dll"
levdll = ctypes.CDLL(DLL)
levdll.levenshtein.argtypes = [ctypes.c_char_p, ctypes.c_char_p]
levdll.levenshtein.restype = ctypes.c_int
def lev_ctypes(a: str, b: str) -> int:
return int(levdll.levenshtein(a.encode('utf-8'), b.encode('utf-8')))
def bench(fn, *args, repeat=3, warmup=1):
for _ in range(warmup): fn(*args)
best = float("inf"); out_best = None
for _ in range(repeat):
t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
if dt < best: best, out_best = dt, out
return out_best, best
if __name__ == "__main__":
cases = [(2000,2000),(4000, 4000)]
print("Benchmark: Python vs NumPy vs C (ctypes)n")
for n, m in cases:
a, b = rnd_ascii(n), rnd_ascii(m)
py_out, py_t = bench(lev_py, a, b, repeat=3)
ct_out, ct_t = bench(lev_ctypes, a, b, repeat=3)
print(f"n={n} m={m}")
print(f" Python : {py_t:.3f}s -> {py_out}")
print(f" ctypes : {ct_t:.3f}s -> {ct_out}n")
And the results?
Benchmark: Python vs C (ctypes)
n=2000 m=2000
Python : 1.258s -> 1769
ctypes : 0.019s -> 1769
n=4000 m=4000
Python : 5.138s -> 3521
ctypes : 0.035s -> 3521
We have very similar results to the first example.
3 / Python C Extensions: Lev_CEXT.C
When using Python C Extensions, there is a lot of work involved. First, let's examine the C code. The basic algorithm does not change. It's just that we need to add a little more customization to allow the code to be called from Python. It uses the CHYthoni API (Python.h) Python arguments, run the C code, and return the result as a Python number.
Work sevalex_lev acts as a wrapper. It concatenates two arguments from Python (Pyarg_ParSeSepuple), calling a C function Lev_impl To round up the distance, you handle out-of-memory errors, and return the result as a python integer (Pyloong_fromlong). The table methods register this function under the name “levenhtein”, so it can be called from Python code. Finally, pyinit_levext defines and initializes the module prominenceTo do that import into Python with the Levent Levet command.
#include
#include
#include
static int lev_impl(const char* a, const char* b) {
size_t n = strlen(a), m = strlen(b);
if (n == 0) return (int)m;
if (m == 0) return (int)n;
int* prev = (int*)malloc((m + 1) * sizeof(int));
int* curr = (int*)malloc((m + 1) * sizeof(int));
if (!prev || !curr) { free(prev); free(curr); return -1; }
for (size_t j = 0; j <= m; ++j) prev[j] = (int)j;
for (size_t i = 1; i <= n; ++i) {
curr[0] = (int)i; char ca = a[i - 1];
for (size_t j = 1; j <= m; ++j) {
int cost = (ca == b[j - 1]) ? 0 : 1;
int del = prev[j] + 1, ins = curr[j - 1] + 1, sub = prev[j - 1] + cost;
int d = del < ins ? del : ins; curr[j] = d < sub ? d : sub;
}
int* tmp = prev; prev = curr; curr = tmp;
}
int ans = prev[m]; free(prev); free(curr); return ans;
}
static PyObject* levext_lev(PyObject* self, PyObject* args) {
const char *a, *b;
if (!PyArg_ParseTuple(args, "ss", &a, &b)) return NULL;
int d = lev_impl(a, b);
if (d < 0) { PyErr_SetString(PyExc_MemoryError, "alloc failed"); return NULL; }
return PyLong_FromLong(d);
}
static PyMethodDef Methods[] = {
{"levenshtein", levext_lev, METH_VARARGS, "Levenshtein distance"},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef mod = { PyModuleDef_HEAD_INIT, "levext", NULL, -1, Methods };
PyMODINIT_FUNC PyInit_levext(void) { return PyModule_Create(&mod); }
Because we're not just building this runtime but a traditional Python module, we have to build a separate code.
This type of module must be compiled against Python headers and tightly linked to the Python runtime so it behaves like a built-in Python module.
To achieve this, we created a Python module called Stup.Py, which imports the SECATSOLS library to simplify this process. It uses:
- Getting right includes Python.h methods
- It passes the appropriate assembler and assembly flags
- Generate a .pyd file with the appropriate naming convention for your Python version and platform
To do this manually with cl The Compiler command will be messy and error prone, because you have to specify all those methods and flags manually.
Here is the code we need.
from setuptools import setup, Extension
setup(
name="levext",
version="0.1.0",
ext_modules=[Extension("levext", ["lev_cext.c"], extra_compile_args=["/O2"])],
)
We run it using standard python on the command line, as shown here.
(pythonc) c:> python setup.py build_ext --inplace
#output
running build_ext
copying buildlib.win-amd64-cpython-312levext.cp312-win_amd64.pyd ->
Estimating Python C Code
Now, here is the Python code to call C. C.
# c-ext benchmark
import time, random, string
import numpy as np
import levext # make sure levext.cp312-win_amd64.pyd is built & importable
def lev_extension(a: str, b: str) -> int:
return levext.levenshtein(a, b)
def bench(fn, *args, repeat=3, warmup=1):
for _ in range(warmup): fn(*args)
best = float("inf"); out_best = None
for _ in range(repeat):
t0 = time.perf_counter(); out = fn(*args); dt = time.perf_counter() - t0
if dt < best: best, out_best = dt, out
return out_best, best
if __name__ == "__main__":
cases = [(2000, 2000), (4000, 4000)]
print("Benchmark: Python vs NumPy vs C (C extension)n")
for n, m in cases:
a, b = rnd_ascii(n), rnd_ascii(m)
py_out, py_t = bench(lev_py, a, b, repeat=3)
ex_out, ex_t = bench(lev_extension, a, b, repeat=3)
print(f"n={n} m={m} ")
print(f" Python : {py_t:.3f}s -> {py_out}")
print(f" C ext : {ex_t:.3f}s -> {ex_out}n")
Here is the output.
Benchmark: Python vs C (C extension)
n=2000 m=2000
Python : 1.204s -> 1768
C ext : 0.010s -> 1768
n=4000 m=4000
Python : 5.039s -> 3526
C ext : 0.033s -> 3526
So this gave quick results. Type C shows more than 150 times faster than pure Python in the second test case above.
Not too shabby.
But what about Numpy?
Some of you may be wondering why the profit margin was not used. Yes, inunus is interesting for vectomed array functions, such as dot products, but not all pure mapping algorithms for finding power. Calculating elenshein distances is a step-by-step process, so inuppy doesn't offer much help. In those cases, it drops to c nge more, ctypesor a traditional native extension It offers real RealTime SpeedSUPS while still being cheap from Python.
PS. I ran some additional tests using the code that would help in using inunce, and it was no surprise that inughpy was as fast as CODE CODE. That's to be expected if the benefit uses C under the hood and has many years of development and optimization behind it.
To put it briefly
The article explores how Python developers can overcome performance bottlenecks in broad tasks, such as computing The levenshtein distance– The same algorithm -By combining C Code in Python. While libraries like Numpy speed up fuzzy functions, sequential algorithms like levenhtein remain limited in Numpy's performance.
To deal with this, I showed Three assembly patterns ranging from simple to more advanced, allowing you to call fast CORD code from Python.
Subprocess. Integrate C code into the implementation (eg with GCC or GCC build tools or virtual This is easy to set up and shows great speed compared to pure Python.
ctypes.You use Ctypes which allow Python directly and call functions from their shared libraries without needing to write a complete Python Extension module. This makes it much easier and faster to compile critical C code to run in Python, avoiding looking at external processes while still keeping your code in Python.
Python C Extensions . Write a full Python extension in C using the CHYthon API (Python.h). This requires a lot of setup but provides faster performance and more integration, allowing you to call C functions as if they were Python functions.
The benchmarks show that the levennein algorithm has been implemented more than 100 × very quickly rather than pure Python. While the external library is like It's politeExcels Vectorised Numbers that work, it does not greatly improve the performance of sequential algorithms such as levenhtein, making C integration a better choice in these cases.
If you're hitting performance limits in Python, heavy-loading in C can provide significant speed improvements and is worth considering. You can start simple with a subprocess, then move to ctypes or full C extensions for harder integration and better performance.
I have described only the three most popular ways to compile C code with Python, but there are a few other ways that I recommend you read if you are reading this article if you like this article if you like this article if you like this article if you like this article if you like this article.


