ANI

We are shocked by Duckdb, Sqlite, Nicken Pandas on 1M lines: Here's what happened

Duckd vs sqlite vs pandas
Photo by the writer

Obvious Introduction

There are many datasets processing tools today. They all say – yes they did – that they are best and choose what is right. But is it? There are two main needs of these tools that should satisfy: It should make easy daily average data and do it, or under the pressure of large detail.

Determining the best tool between DuckdB, Sqlitebesides Pings to the headWe examined them under these circumstances.

First, we only gave daily analysis: Prices summarize, class groups, sorting, and stadiums. This shows how analysts work with real datasets, compared to the conditions that are designed to show the best tools of the tool.

Second, we have done those activities with kagle data with more than 1 million lines. Point point – is small enough to work in a single machine, however large enough for memory speed and question speed begins to produce a clear difference between tools.

Let's see how those tests are taken.

Obvious The data uses us used

// Views of data

We used bank dataset from Grandfather. This data contains over 1 million lines, including five columns:

The name of the column Description
Palm tree The day that happened
Background Business class or kind (retail, restaurant)
Location The Goographic Region (Goa, Mathura)
Love The amount of change
Tranction_Coud The total amount of transaction on that day

This data is produced using python. While it may be completely different with real health data, size and composition is sufficient to check and compare performance differences between tools.

// Access to Details by Pandas

We used Pings to the head Dataset uploading a Jickyter manual and checks its normal formation, size and empty prices. Here is the code.

import pandas as pd
df = pd.read_excel('bankdataset.xlsx')

print("Dataset shape:", df.shape)

df.head()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

If you want a quick reference to regular jobs when checking datasets, check this cheating sheet of active pandas.

Before you are considered, let us look at how we can set the environment.

Obvious To set the appropriate checkpoint

All three tools – DUCKD, SQLITE, NAMPHASS – organized and runs in the same Pieces of Jobyter Nezebook to ensure the test was wrong. This confirmed that the circumstances during the implementation and application of memory lasts.

First, we installed and loaded packages needed.

Here are the tools we needed:

  • Pandas: To a normal level DataFrame functions
  • DuckdB: With SQL murder in DataFrame
  • SQLITE3: By managing the SQL database.
  • Time: by capturing the time of execution
  • Memory_Profir: rating assigned to memory
# Install if any of them are not in your environment
!pip install duckdb --quiet

import pandas as pd
import duckdb
import sqlite3
import time
from memory_profiler import memory_usage

Let us now prepare information on the format that can be shared in all three tools.

// Loading data in panda

We will use panda to load the dataset once, then we will share or register for DuckdB and SQLITE.

df = pd.read_excel('bankdataset.xlsx')

df.head()

Here is the output to confirm.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// Registering data with DuckdB

Duckdb allows you to reach direct Pandas DataFrames. You don't have to change anything – just sign up and ask. Here is the code.

# Register DataFrame as a DuckDB table
duckdb.register("bank_data", df)

# Query via DuckDB
duckdb.query("SELECT * FROM bank_data LIMIT 5").to_df()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// Preparation of SQLITE data

Since SQLITE does not read Excel files directly, we started by adding panda DataFrame to the memory database. Then, we used a simple question to check the data format.

conn_sqlite = sqlite3.connect(":memory:")

df.to_sql("bank_data", conn_sqlite, index=False, if_exists="replace")

pd.read_sql_query("SELECT * FROM bank_data LIMIT 5", conn_sqlite)

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

Obvious How did we write the tools

We have used the same four questions in Duckb, Sqlite, and Pandas to compare their performance. Each question is designed to address the general work of analysis that shows how the data analysis is used in real world.

// To ensure a fixed setup

Memory dataset was used by all three tools.

  • Pandas ask DataFrame harm
  • Duckb is made of SQL questions directly against DataFrame
  • The SQLITE has been kept a copy of DataFrame In-memory database and ran SQL questions in

This method has made sure that all three tools use the same data and work with the same program settings.

// To estimate the workout period

Tracking the Quality Distance, Python's time Module wrapped each question in the Start / End Timer simple. Only the question is killed; Steps to upload data and configuration settings are not included.

// Tracking the use of memory

And the processing time, memory usage shows how each engine is working for large datasets.

When desired, memory usage can be rapidly ridiculed before and after each question to measure the use of additional RAM.

// Benchmark questions

Check each engine in the same daily analysis activities:

  1. The full amount of purchase: summarizing the number column
  2. A group for domain: Consolidation of a transaction including each paragraph
  3. Sort by location: Sorting lines in status before combining
  4. Group with a place and place: multiple field integration on measurements

Obvious Benchmark results

// Question 1: The total amount of purchase

Here we measure that pandas, duckd, and how SQLITE does when summarizing Value column across the data.

// Pandas performance

We calculate the full amount of purchase using .sum() occupile Value column. Here is the code.

pandas_results = []

def pandas_q1():
    return df['Value'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
pandas_results

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// DuckdB operation

We calculate the total purchase amount using fully integrated. Here is the code.

duckdb_results = []

def duckdb_q1():
    return duckdb.query("SELECT SUM(value) FROM bank_data").to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
duckdb_results

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// SQLITE operation

We calculate the total amount of shopping in summary the value column. Here is the code.

sqlite_results = []

def sqlite_q1():
    return pd.read_sql_query("SELECT SUM(value) FROM bank_data", conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q1()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Total transaction value",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
sqlite_results

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// A fully functional analysis

Now let's compare time for the killing of memory. Here is the code.

import matplotlib.pyplot as plt


all_q1 = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

all_q1.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
all_q1.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

PANDAS is still too far and the memory that works well here, completing almost quickly with a minimum of RAM. DuckdB is slow and uses more memory but always working well, while the SQlite is both slow and very heavy in the terms of memory use.

// Question 2: Group on the background

Here we measure that pandas, duckd, and how the SQLITE worked when planning trading with Domain to summarize their calculation.

// Pandas performance

We calculate the total amount of purchase each background uses .groupby() occupile Domain column.

def pandas_q2():
    return df.groupby('Domain')['Transaction_count'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})
[p for p in pandas_results if p["query"] == "Group by domain"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// DuckdB operation

We calculate the total amount of purchase each background using SQL GROUP BY occupile domain column.

def duckdb_q2():
    return duckdb.query("""
        SELECT domain, SUM(transaction_count) 
        FROM bank_data 
        GROUP BY domain
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Group by domain"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// SQLITE operation

We calculate the total amount of purchase each background using SQL GROUP BY at the memory table.

def sqlite_q2():
    return pd.read_sql_query("""
        SELECT domain, SUM(transaction_count) AS total_txn
        FROM bank_data
        GROUP BY domain
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q2()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Group by domain"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// A fully functional analysis

Now let's compare time for the killing of memory. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

groupby_results = [r for r in (pandas_results + duckdb_results + sqlite_results) 
                   if "Group by" in r["query"]]

df_groupby = pd.DataFrame(groupby_results)

fig, axes = plt.subplots(1, 2, figsize=(10,4))

df_groupby.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_groupby.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

DuckdB is very fast, pandas selling more memory of the memory below, while both SQLIite and a very large memory memory.

// Question 3: Sort by location (Goa)

Here we measure that pandas, duckd, and the SQLITE How do you filter Dataet for Location = 'Goa' and summarizing the amounts of the transaction.

// Pandas performance

We have filters lines of Location == 'Goa' and their money. Here is the code.

def pandas_q3():
    return df[df['Location'] == 'Goa']['Value'].sum()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in pandas_results if p["query"] == "Filter by location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// DuckdB operation

We sort out the transaction of Location = 'Goa' Then calculate their full amount. Here is the code.

def duckdb_q3():
    return duckdb.query("""
        SELECT SUM(value) 
        FROM bank_data 
        WHERE location = 'Goa'
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Filter by location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// SQLITE operation

We sort out the transaction of Location = 'Goa' and their money. Here is the code.

def sqlite_q3():
    return pd.read_sql_query("""
        SELECT SUM(value) AS total_value
        FROM bank_data
        WHERE location = 'Goa'
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q3()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Filter by location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Filter by location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// A fully functional analysis

Now let's compare time for the killing of memory. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

filter_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
                  if r["query"] == "Filter by location"]

df_filter = pd.DataFrame(filter_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_filter.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False, title="Execution Time (s)")
df_filter.plot(x="engine", y="memory", kind="barh", color="salmon", ax=axes[1], legend=False, title="Memory Usage (MB)")

plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

Duckdb is the fastest and most effective; Pandas is slow about high memory usage; And the SQLITO is slow but easy to memory.

// Question 4: Team by the background

// Pandas performance

We calculate the average shopping amount that is divided by both of them Domain including Location. Here is the code.

def pandas_q4():
    return df.groupby(['Domain', 'Location'])['Value'].mean()

mem_before = memory_usage(-1)[0]
start = time.time()
pandas_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

pandas_results.append({
    "engine": "Pandas",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in pandas_results if p["query"] == "Group by domain & location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// DuckdB operation

We calculate the average shopping amount that is divided by both of them domain including location. Here is the code.

def duckdb_q4():
    return duckdb.query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """).to_df()

mem_before = memory_usage(-1)[0]
start = time.time()
duckdb_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

duckdb_results.append({
    "engine": "DuckDB",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in duckdb_results if p["query"] == "Group by domain & location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// SQLITE operation

We calculate the average shopping amount that is divided by both of them domain including location. Here is the code.

def sqlite_q4():
    return pd.read_sql_query("""
        SELECT domain, location, AVG(value) AS avg_value
        FROM bank_data
        GROUP BY domain, location
    """, conn_sqlite)

mem_before = memory_usage(-1)[0]
start = time.time()
sqlite_q4()
end = time.time()
mem_after = memory_usage(-1)[0]

sqlite_results.append({
    "engine": "SQLite",
    "query": "Group by domain & location",
    "time": round(end - start, 4),
    "memory": round(mem_after - mem_before, 4)
})

[p for p in sqlite_results if p["query"] == "Group by domain & location"]

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

// A fully functional analysis

Now let's compare time for the killing of memory. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

gdl_results = [r for r in (pandas_results + duckdb_results + sqlite_results)
               if r["query"] == "Group by domain & location"]

df_gdl = pd.DataFrame(gdl_results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

df_gdl.plot(x="engine", y="time", kind="barh", ax=axes[0], legend=False,
            title="Execution Time (s)")
df_gdl.plot(x="engine", y="memory", kind="barh", ax=axes[1], legend=False,
            title="Memory Usage (MB)", color="salmon")

plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

The duckdB deals with multi-speed Multi-BROS with moderate memory use, slow pandas with the highest memory use, and the SQLITO is very slow about the use of a major memory.

Obvious The last comparison of all the questions

We compared these three engines against each other in messages and speed. Let's see time to work again. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = [
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results[all_results["query"] == q]
    axes[i].barh(d["engine"], d["time"], 
                 color=[engine_colors[e] for e in d["engine"]])
    for y, v in enumerate(d["time"]):
        axes[i].text(v, y, f" {v:.3f}", va="center")
    axes[i].set_title(q, fontsize=10)
    axes[i].set_xlabel("Seconds")

fig.suptitle("Per-Measure Comparison — Execution Time", fontsize=14)
plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

This chart shows that the duckdB always keeps the lowest times in almost all the questions, except the total transaction of the transaction where pandas passed there; The SQLITE is slowly moving with a wider line on the board. Let's look at the next memory. Here is the code.

import pandas as pd
import matplotlib.pyplot as plt

all_results = pd.DataFrame(pandas_results + duckdb_results + sqlite_results)

measure_order = [
    "Total transaction value",
    "Group by domain",
    "Filter by location",
    "Group by domain & location",
]
engine_colors = {"Pandas": "#1f77b4", "DuckDB": "#ff7f0e", "SQLite": "#2ca02c"}

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()

for i, q in enumerate(measure_order):
    d = all_results[all_results["query"] == q]
    axes[i].barh(d["engine"], d["memory"], 
                 color=[engine_colors[e] for e in d["engine"]])
    for y, v in enumerate(d["memory"]):
        axes[i].text(v, y, f" {v:.1f}", va="center")
    axes[i].set_title(q, fontsize=10)
    axes[i].set_xlabel("MB")

fig.suptitle("Per-Measure Comparison — Memory Usage", fontsize=14)
plt.tight_layout()
plt.show()

Here's the outgoing.

Duckd vs sqlite vs pandasDuckd vs sqlite vs pandas

This chart shows that SQLIITE is between the best and worst working on memory operations, pandas are the most beautiful and two serious cases, while duckdb is always during all questions. As a result, the duckdb proves that a balanced cheise is perfect perfect, it brings quick performance through moderate memory usage. Pandas show overrides – sometimes quick, sometimes difficult – while sqlites fight for speeds and often perfect on the poor side of the memory.

Nate Rosid He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button