Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Arquitectura Desarrollo de Software

analytics data engineering duckdb olap pandas pandas replacement parquet

DuckDB: Fast Analytics Without Moving Data

November 1, 2024 13 min read 50 reads

Table of contents

Key takeaways
What it is and why it matters
The OLAP vs OLTP distinction, again
What analytics without moving data actually means
How it fits in a real project
DuckDB and LLM: embedding analytics
Where DuckDB is not the answer
Conclusion

Actualizado: 2026-05-03

DuckDB^[1] reached 1.0 in June 2024 and shipped 1.1 in September. After years of being “that curious thing researchers use”, the project has reached a maturity point where it starts to displace tools nobody expected it to compete with: pandas, local Spark, and in many cases, the cloud data warehouse. This article explains, without hype, where it genuinely fits and where it does not.

Key takeaways

DuckDB is columnar and vectorised: it processes data in vectors of thousands of elements exploiting SIMD, approaching the theoretical hardware limit.
It reads Parquet, CSV, JSON, S3, and URLs directly without importing data into an internal table.
For datasets up to a hundred gigabytes, it beats pandas on speed and Snowflake on friction cost.
The API is compatible with pandas, Polars, and Arrow without copying memory.
It is not a transactional engine, does not support concurrent writes, and does not replace the warehouse for data shared across multiple teams.

What it is and why it matters

The short description is easy: DuckDB is to analytics what SQLite is to transactional databases. An embedded library, no server, no daemon, no credentials, loaded inside your application or notebook process and exposing a complete SQL engine.

What makes it different is its internal design. DuckDB is columnar and vectorised: instead of processing row by row, it groups values into vectors of a few thousand elements and applies each operation over the whole vector. This lets it exploit the CPU’s SIMD instructions, keep data warm in L1/L2 cache, and in practice get close to the theoretical hardware limit. For analytical workloads — aggregations, joins, windows, massive scans — that architectural choice is the difference between seconds and minutes.

The magnitude of the difference in practice:

Task	pandas	DuckDB	Local Spark
Query over 1 GB CSV	30 s	2 s	10 s
Aggregation over 10 GB Parquet	OOM	5 s	20 s
Join of two 100 M row tables	OOM	15 s	40 s

On a decent laptop, DuckDB resolves in seconds what pandas cannot even load and what Spark takes tens of seconds to coordinate across its executors.

The OLAP vs OLTP distinction, again

It helps to remember why databases are not interchangeable. An OLTP system like PostgreSQL or MySQL is designed for short, concurrent transactions that touch few rows each — inserting an order, updating a balance, reading a user profile. B-tree indexes, the transaction log, and MVCC are all optimised for that workload.

An analytical engine lives at the opposite extreme. Queries read millions of rows but only three or four columns, joins are large, aggregations sweep the entire dataset, and write concurrency is low or non-existent. Storing data by columns means reading only what the query asks for, compressing much better, and applying vectorised operations. That is why DuckDB beats Postgres in these scenarios: not because Postgres is slow, but because it is optimised for something else.

What analytics without moving data actually means

The phrase local analytics sounds modest, but describes an important shift in posture. For years, the default answer to “I have data to analyse” was uploading it to a warehouse — Snowflake, BigQuery, Redshift — even if it was ten gigabytes. You paid for storage, for compute per query, and for the complexity of moving data that fit perfectly in the memory of a modern machine.

DuckDB inverts that logic. A laptop with 32 GB RAM and an NVMe SSD can run queries over hundreds of gigabytes of Parquet, streaming from disk, without filling memory, without a cluster, without a monthly bill. The read-files-directly feature is what surprises people most the first time: point at a .parquet, a glob of several, an s3:// path, or an https:// URL and the engine resolves access, schema inference, predicate push-downs, and column projection for you.

python

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

df = con.execute("""
    SELECT category, SUM(amount) AS total, COUNT(*) AS n
    FROM 's3://bucket/sales/*.parquet'
    WHERE date >= '2024-01-01'
    GROUP BY category
    ORDER BY total DESC
""").df()

That single block covers three ideas that previously required separate tools: streaming read from a data lake, analytical aggregation over multiple files, and returning the result as a DataFrame ready to plot.

How it fits in a real project

The canonical example is replacing pandas in the data exploration and preparation phase. DuckDB’s API is deliberately compatible with the ecosystem: you can run SQL over a pandas DataFrame, a Polars DataFrame, an Arrow RecordBatch, or files on disk, and return the result in any of those formats, almost always without copying memory.

Extensions expand the engine with one line:

httpfs: S3, GCS, Azure Blob, and HTTP URL access.
spatial: geometry and GIS analysis with ST_*.
json: functions for semi-structured data.
sqlite and postgres: direct reading from external databases.

In more formal teams, DuckDB has become the default dbt adapter for local development: you compile, test, and materialise models against a .duckdb file on your machine before promoting to production. The flow is identical to production on Snowflake or BigQuery — same SQL syntax, same dbt — with seconds of latency instead of minutes and no per-query compute cost.

DuckDB and LLM: embedding analytics

A recent use case that has gained traction is using DuckDB for analytics over embedding datasets generated by language models. Although it lacks the native vector similarity search support of pgvector, it does allow:

Loading large embedding matrices from Parquet.
Computing cosine similarities as vectorised SQL operations.
Filtering and aggregating on metadata in the same query.

For embedding exploration during research or RAG pipeline validation, DuckDB + Parquet is a lightweight, fast combination before committing to a vector database in production.

Where DuckDB is not the answer

Honesty requires acknowledging the limits:

Multi-user with concurrent writes: one writer at a time and many readers. Not for end-user applications with parallel transactions.
Real-time streaming: it is not Flink or ksqlDB; DuckDB processes batch data, not continuous streams.
Production datasets at scale: when the dataset exceeds hundreds of gigabytes and multiple teams query it with SLAs, a managed warehouse remains the correct option.
Transactional operations: for inserting orders or updating balances, PostgreSQL or MySQL are the right tool.

Conclusion

What makes DuckDB interesting at this maturity point is not that it is fast — there have been fast engines for decades — but that it has eliminated the friction between having an analytical problem and solving it. No cluster to spin up, no warehouse to provision, no data to move, no bill to watch. Install with pip, read files where they already live, and get results in seconds. For the vast majority of workloads below a hundred gigabytes, that combination makes the alternatives feel disproportionate.

Was this useful?

[Total: 13 · Average: 4.4]

Post Views: 50

DuckDB

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

DuckDB: Fast Analytics Without Moving Data

Key takeaways

What it is and why it matters

The OLAP vs OLTP distinction, again

What analytics without moving data actually means

How it fits in a real project

DuckDB and LLM: embedding analytics

Where DuckDB is not the answer

Conclusion

Related posts

Hybrid RAG in 2026: the patterns that keep winning

MCP as multi-vendor standard: patterns already mature

Skills and subagents: the agent reuse pattern

Kubernetes 1.35 GA: an operations-side balance sheet