ANI

Combining Duckdb & Python: Analytical Guide


Photo by the writer

DuckdB Database of quick analysis, analyzing the process designed for modern data analysis. It works directly from your Python script, which means no unique server is needed, and passes through the complex pathways because of the storage of its column and its murder.

As perceiving how to deal with the data is very important, today I want to show you how to build the Pythal work with Duckb and check its important features.

Let's get in!

What is DuckdB?

DuckdB is a free, open, open, in-Process Alap Database built for quick, local area. Unlike traditional knowledge that works as external services, Duckb operates within your application, except the required server. As system of olap, the duckdb stores the column data (not lines such as OLTP programs), making it a very effective questionnaire with questions of analytics such as joining, combined and groups.

Consider the duckdb such as a light, sophisticated version of the SQLITE, which brings simplicity of local information and modern-data storage energy. And this leads us to the next natural question …

What are the main duckdb features?

Burning-quick quiz of analysis

DuckdB submits an impressive operation of olap operations, amazing users who often get used to traditional details such as postgresql. Unlike normal olap programs that can be lazy due to considerable data volume, duckb puts the columnar, the performance engine. The project is doing well the use of the CPU cache and accelerates the performance of the analysis question.

SQL + Support Facility with Singing Language

DuckdB provides full support for the complex search for SQL questions and produces APIs in many languages, including Java, C, and C ++. Its firm integration and Python and R makes ready for an effective data analysis. You can write questions directly to your preferred location, with additional SQL syntax enhancements (eg.

And the best part is that the duckdb consists entirely, without external dependence or headset.

Free and open source

Duckdb is completely open and zealously is the growing community of donors. This guarantees the promotion of the prompt feature and modifications. And yes, it's free to use. While the future licenses are always there, currently you get a powerful analysis engine at zero cost.

Now, as we know its best features, let's get started with it!

Starting with DuckdB

The duckdB installation process is slightly dependent on your area, but everything, it is quick and easy. As DuckdB is a embedded data engine without server requirements or external dependence, setup usually takes a few lines of the code. You can find the complete installation guide for DUCKD official documents.

Requirements

Before you enter, make sure you have the following:

  • Python 3.13 or later
  • Basic understanding of SQL and data analysis in Python

You can easily install the duckdB in your area by making the following command:

Working with Duckdb Epython

Once you add a duckdb, it is very easy to start. You entered the duckdB in your environment, and connect to the existing database or create a new one if needed.

For example:

import duckdb 
connection = duckdb.connect()

If no file File File connect() Method, DuckdB will create a new memory database automatically. That means, an easy way to start using SQL questions to use sql() direction directly.

# Source: Basic API usage - 
import duckdb
duckdb.sql('SELECT 42').show()

Running this command is up to Global In-Memory Duckdb Example within the Python module and also relationships, a symbolic representation of the question.

Important, the question itself is not killed until you can ask the outcome clearly, as shown below:

# Source: Execute SQL - 
results = duckdb.sql('SELECT 42').fetchall()
print(results)

"""
[(42,)]
"""

Now let's work with some real data. DuckdB supports a wide range of file formats, including CSV, JSON, and Parquet, and uploading them is easy.

You can see how straightforward are in the bottom:

# Source: Python API -  
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation
duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file

Working with foreign data sources in Duckbb

One of the duckdB standing is your ability to ask the external data files, without requiring the database or upload all information in memory. Unlike traditional details that require a first-time data, Duckb supports the 15th model

This approach brings a few important benefits:

  • Small memory usage: The correct parts of the file read in memory.
  • No importing / export: Ask your area-in-place-no need to travel or double.
  • Directed Work Relations: Question For mobile phones and formats using one SQL statement.

Exercising Duckd Use, we will use the simple CSV file you can find from the next Kaggle link.

To ask data, can easily describe a simple question that displays our file method.

# Query data directly from a CSV file
result = duckdb.query(f"SELECT * FROM '{source}'").fetchall()
print(result)

We can now easily manage data using logic like SQL directly and duckdB.

Sorting lines

Focusing on certain data items, use when the EduckdB clause. Filter lines based on the terms that use comparator operator (>, <, =, <> the operator (once, or, not) to get complicated talks.


# Select only students with a score above 80
result = duckdb.query(f"SELECT * FROM '{source}' WHERE total_passengers > 500").fetchall()
result

Sorting the results

Use the order with a fragrance of preparation results in one or more column. Reverse (ASC), but you can specify the decline (Desc). Planning multiple columns, separating the commas.

#Sort months by number of passengers
sorted_result = duckdb.query(f"SELECT * FROM '{source}' ORDER BY total_passengers DESC ").fetchall()
print("nMonths sorted by total traffic:")
print(sorted_result)

Adding Formal Complaints

Create new columns that your question uses talks and keyword. Use the arithmetic operator or built-in operations to modify the column data appears in consequences but touches the original file.

# Add 10 bonus points to each score
bonus_result = duckdb.query(f"""
   SELECT
       month,
       total_passengers,
       total_passengers/1000 AS traffic_in_thousands
   FROM '{source}'
""").fetchall()
print("nScores with 10 bonus points:")
print(bonus_result)

Trials are used

Through the complex conversion, SQL provides a case talk. This applies equally to the implants of the planning, which allows you to use a conditional sense in your questions.

segmented_result = duckdb.query(f"""
   SELECT
       month,
       total_passengers,
       CASE
           WHEN total_passengers >= 100 THEN 'HIGH'
           WHEN total_passengers >= 50 THEN 'MEDIUM'
           ELSE 'LOW'
       END AS affluency
   FROM '{source}'
""").fetchall()
print("nMonth by affluency of passangers")
print(segmented_result)

Store

DuckdB is the head of the Ola-Performance of the Database built for data experts that need to check and analyze large datasets properly. Its SQL engine works sophisticated questions directly to your area – No different server is needed. With Supportive Support of Python, R, Java, C ++, and additionally, the duckb fits naturally in your work travel, whether you are a favorite language.

You can go to check the perfect code in the following GitHub Repository.

JOSEP FERRER by analytics engineer from Barcelona. Graduated from physics engineer and is currently working in a data science association used for human movement. He is a temporary content operator that focuses on science and technology. Josep writes in all things Ai, covering the use of ongoing explosion on the stadium.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button