DuckDB: Fast Analytics Without Moving Data

Gráficos de datos con columnas coloridas sobre pantalla moderna

DuckDB reached 1.0 in June 2024 and has just shipped 1.1 in September. After years of being “that curious thing researchers use”, the project has reached a maturity point where it starts to displace tools nobody expected it to compete with: pandas, local Spark and, in many cases, the cloud data warehouse. This article is an attempt to explain, without hype, where it genuinely fits and where it does not.

What it is and why it matters

The short description is easy: DuckDB is to analytics what SQLite is to transactional databases. An embedded library, no server, no daemon, no credentials, loaded inside your application or notebook process and exposing a complete SQL engine. The long description is more interesting, because what makes it different is not the shape but the internal design.

DuckDB is columnar and vectorised. Instead of processing row by row, it groups values into vectors of a few thousand elements and applies each operation over the whole vector. That lets it exploit the CPU’s SIMD instructions, keep data warm in L1/L2 cache and, in practice, get close to the theoretical limit of the hardware. For analytical workloads — aggregations, joins, windows, massive scans — that architectural choice is the difference between seconds and minutes.

The OLAP vs OLTP distinction, again

It helps to remember why databases are not interchangeable. An OLTP system like PostgreSQL or MySQL is designed for short, concurrent transactions that touch few rows each: insert an order, update a balance, read a user profile. The row is the natural unit because you usually want it whole. B-tree indexes, the transaction log and MVCC are all optimised for that workload.

An analytical engine lives at the opposite extreme. Queries read millions of rows but only three or four columns, joins are large, aggregations sweep the entire dataset and write concurrency is low or non-existent. Storing data by columns means reading only what the query asks for, compressing much better — values within a column look alike — and applying vectorised operations. That is why DuckDB beats Postgres in these scenarios: not because Postgres is slow, but because it is optimised for something else.

The classic table illustrates the magnitude:

Task pandas DuckDB Local Spark
1 GB CSV query 30 s 2 s 10 s
10 GB Parquet aggregation OOM 5 s 20 s
Join of two 100 M-row tables OOM 15 s 40 s

On a decent laptop, DuckDB resolves in seconds what pandas cannot even load and what Spark takes tens of seconds to coordinate between executors.

What “local analytics” means in practice

The phrase local analytics sounds modest but describes an important shift in stance. For years, the default answer to “I have data to analyse” was to push it into a warehouse — Snowflake, BigQuery, Redshift — even if it was ten gigabytes. You paid for storage, you paid for compute per query, you paid for egress when you needed the data back and you paid for the complexity of moving data that fit perfectly in the memory of a modern machine.

DuckDB inverts that logic. A laptop with 32 GB of RAM and an NVMe SSD is capable of running queries over hundreds of gigabytes of Parquet, streaming from disk, without filling memory, without a cluster, without a bill. Reading files directly is the part that surprises people the first time: you point at a .parquet, at a glob of several, at an s3:// path or even at an https:// URL and the engine resolves access, schema inference, predicate push-downs and column projection for you. Data is not moved into an internal table; it is read where it lives. This changes the workflow completely: exploration, prototyping, ETL, CI validations, ML preprocessing and derived-dataset generation stop being infrastructure projects and become short scripts.

How it fits into a real project

The canonical example is replacing pandas during data exploration and preparation. The API is deliberately friendly to the surrounding ecosystem: you can run SQL over a pandas DataFrame, over a Polars one, over an Arrow RecordBatch or over files on disk, and return the result in any of those formats, almost always without copying memory.

import duckdb

# Open an in-memory connection and query remote Parquet directly
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

df = con.execute("""
    SELECT category, SUM(amount) AS total, COUNT(*) AS n
    FROM 's3://bucket/sales/*.parquet'
    WHERE date >= '2024-01-01'
    GROUP BY category
    ORDER BY total DESC
""").df()

That single block covers three ideas that used to need separate tools: streaming reads from a data lake, analytical aggregation across several files and delivery of the result as a DataFrame ready to plot. Extensions (httpfs, spatial, json, sqlite, postgres, icu) install in one line and broaden the engine without touching the base.

In more formal teams, DuckDB has become the default dbt adapter for local development: you compile, test and materialise models against a .duckdb file on your machine before promoting to production. And if the volume grows, MotherDuck offers the same engine in the cloud with identical syntax, which lets you scale without rewriting.

Where DuckDB is not the answer

Honesty asks us to name the limits. DuckDB is not multi-user with concurrent writes: one writer at a time and many readers. It is not a transactional engine for end-user applications; Postgres or SQLite still occupy that slot. It is not real-time streaming in the style of Flink or ksqlDB. And when a production dataset climbs past the hundreds of gigabytes and is consulted by several teams with an SLA, a managed warehouse is still the right choice: you pay to not operate the infrastructure yourself.

Conclusion

What makes DuckDB interesting at this moment of maturity is not that it is fast — fast engines have existed for decades — but that it has removed the friction between having an analytical problem and solving it. There is no cluster to stand up, no warehouse to provision, no data to move, no invoice to watch. It installs with pip, reads files where they already sit and returns results in seconds. For the vast majority of workloads below a hundred gigabytes — which is where most real analysis in small and mid-sized companies actually lives — that combination makes the alternatives suddenly look disproportionate. It does not replace the warehouse; it redefines where one begins to be necessary.

Entradas relacionadas