Machine Learning

Faster Is Not Always Better: Choosing the Right Strategy for Installing PostgreSQL in Python (+Benchmarks)

shows that it is absolutely possible to insert 2M records per second in Postgres. Instead of chasing micro-benchmarks, in this article we will go back to ask the most important question: What are the things that push us to fit our work?

We will look at 5 ways to insert data into Postgres using Python. The goal is not just to look at input speed and crown a winner but to understand the trade-offs to take out, security, easy again to work.

Eventually you will understand:

  • strengths and weaknesses of ORM, Core and driver level
  • where performance is important
  • how to choose the right tool without over-engineering

Why fast installation is important

High-volume installation workloads appear everywhere:

  • loading millions of records
  • synchronizing data to external APIs
  • backfilling analytics tables
  • importing events or entering warehouses

Small inefficiencies add up quickly. Converting a 3-minute installation task to 10 seconds can reduce system load, free up staff and improve overall performance.

That said, faster doesn't automatically mean better. If the workload is small, sacrificing clarity and security for small benefits rarely pays off.

Understanding when working with news why the real goal.


What tool do we install it with?

To talk to our Postgres database we need a database driver. In our case this psycopg3 with SQLAlchemy layered on top. Here are the quick differences:

Psychopg3 (driver)

psycopg3 is a low-level PostgreSQL driver for Python. This is a very thin piece with little overhead that talks directly to Postgres.
Exchange is responsible: you write the SQL yourself, manage the wash and manage the fairness transparently.

SQLAlchemy

SQLAlchemy sits on top of database drivers such as psycopg3 and offers two layers:

1) SQLAlchemy Core
This is the SQL layer for abstraction and processing. It is database-agnostic which means you write Python expressions and Core will translate them to SQL in the appropriate database-dialect (PostgreSQL / SQL Server / SQLite) and securely bind parameters.

2) SQLAlchemy ORM
ORM is built on top of Core and abstractions even more. Writes Python classes for tables, keeps track of object state and manages relationships. The highest number of ORMs productive and safebut all that bookkeeping presents overhead, especially in bulk operations.

In summary:
All three exist on a spectrum. On the other hand there is ORMwhich takes a lot of the work off your hands and provides a lot of security for the cost of overhead. On the other hand there is The driver it's bare bones but gives great results. Total it's in the middle and gives you a good balance of safety, performance and control.

Simply put:

  • ORM helps you use the Total more easily
  • Total helps you use the The driver more secure and database-agnostic

Benchmark

To keep the benchmark correct:

  • Each method receives data in its own way
    (ORM objects, Core dictionaries, driver tuples)
  • only the time spent moving data from Python to Postgres is measured
  • no method is penalized for the conversion function
  • The database exists in the same location as our Python script; this prevents the benchmark from being initially bottled at loading speed e.g

The goal is not to “get a quick input” but to understand that's what each method does well.

Batch size installation times in 5 different ways

1) Is sooner always better?

What is better? Ferrari or Jeep?

This depends on the problem you are trying to solve.
If you are going through the forest, go with the Jeep. If you want to be first to the finish line, Ferrari is one of the best.

The same applies to installation. Shaving 300 milliseconds off a 10-second input may not justify the added complexity and risk. In some cases, that benefit is totally worth it.

In some cases, the fastest way on paper is to very slow if you count:

  • maintenance costs
  • assurances of righteousness
  • burden of understanding

2) What is your First Point?

The right input strategy is less about row counts and more about what your data already looks like

ORM, Core and driver are not competing tools. They are designed for different purposes:

The way The purpose
ORM (add_all) Business logic, precision, small batches
ORM(bulk_save_object) ORM objects at scale
Total (execute) Organized data, light summary
Driver (executemany) Green lines, high output
Driver (COPY) Bulk import, ETL, firehose workloads

An ORM excels in complex CRUD applications where transparency and security are paramount. Think about websites and APIs. Performance is often “good enough” and clarity is paramount.

Total it shines in situations where you want control without writing raw SQL. Consider data ingestion, batch operations, analysis pipelines and performance-sensitive services such as ETL operations.
You know exactly what SQL you want but you don't want to handle connections or language differences yourself.

I The driver optimized for high performance; very large writes such as writing millions of rows of ML training sets, bulk loads, database maintenance or migration or import services with low latency.

The driver minimizes the output and python overhead and gives you maximum power. The downside is that you have to write the SQL by hand, which makes it easy to make mistakes.


3) Don't rate snapshots

ORM is not slow. COPY is not magic

Performance issues arise when we force data into compression it wasn't designed for:

  • Using Core with SQLAlchemy ORM objects -> is slow due to overhead conversion
  • Using an ORM with tuples -> is inconvenient and fragmented
  • A lot of ORM in the ETL process -> wasted overhead

Sometimes going down to the lowest level can be real reduce to work.


When to choose?

Rule of thumb:

Background Use it when…
ORM Build an operating system (correctness and productivity)
Total Transferring or converting data (balance between security and speed)
The driver Pushes the limits of performance (raw energy and total commitment)

The conclusion

In data and AI systems, performance is rarely limited by the database. It is limited by how well our code matches the state of the data and the abstractions we choose.

ORM, Total again The driver-Level APIs run the spectrum from high-level security to low-level capabilities. They are all excellent tools when used in the context for which they were designed.

The real challenge is not knowing what to fast, it's in choosing the right tool for your situation.


I hope this article was as clear as I intended but if not please let me know what I can do to clarify further. In the meantime, check out mine other topics on all kinds of topics related to the program.

Enjoy coding!

— Mike

Ps: what am I doing? Follow me!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button