This is an English translation of a Japanese blog. Some content may not be fully translated.
📝

---
# Documentation: https://sourcethemes.com/academic/docs/managing-content/

title: "What is DuckDB"
subtitle: ""
summary: "DuckDB is an open-source embedded database specialized for OLAP. It requires no server and runs with just pip install. It can query CSV, Parquet, and JSON directly with SQL. Its columnar storage delivers an analytical environment faster than Pandas and lighter than Spark on a single machine."
tags: ["DuckDB", "OLAP", "Data Analysis"]
categories: ["DuckDB", "Data Analysis"]
url: what-is-duckdb
date: 2026-03-01
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your pages folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
  caption: ""
  focal_point: ""
  preview_only: fals
---

## Introduction

[DuckDB](https://duckdb.org/) is an open-source embedded database specialized for analytical queries (OLAP). It requires no server setup and can be started with a single line: `pip install duckdb`. Born in 2019 at CWI (Centrum Wiskunde & Informatica) in the Netherlands, it has become one of the most talked-about OSS databases as of 2025.

## Overview and Positioning of DuckDB

If you had to describe DuckDB in one phrase, it would be "**the OLAP version of SQLite**." Like SQLite, it is serverless and runs as a single file, but the operations it excels at are fundamentally different.

| Item | SQLite | DuckDB |
|------|--------|--------|
| Primary use | App-embedded DB (OLTP) | Data analysis (OLAP) |
| Storage format | Row-oriented | Column-oriented |
| Strengths | Point queries, row inserts/updates | Aggregation, filtering, JOINs on large datasets |
| Read speed (aggregation) | Slow | Very fast |
| Supported formats | SQLite proprietary format | CSV, Parquet, JSON, Arrow |
| Language support | Many | Python, R, Go, Java, Rust, etc. |

## OLTP vs OLAP — Why They Are Fundamentally Different

Database processing patterns fall broadly into two categories.

OLTP (Online Transaction Processing) → Quickly insert, update, and retrieve individual rows → Examples: E-commerce order processing, bank transfers → Row-oriented storage excels

OLAP (Online Analytical Processing) → Aggregate and analyze large volumes of rows at once → Examples: Monthly sales summaries, user behavior analysis → Column-oriented storage excels


DuckDB is designed for the latter  OLAP workloads. It excels at operations like rapidly reading only specific columns from 100 million rows and performing aggregations.

## Why DuckDB Is Getting So Much Attention Now

### The Limitations of Pandas

Pandas is synonymous with data analysis in Python, but its limitations become apparent with larger datasets.

- **Memory constraints**: The entire dataset must fit in memory
- **Processing speed**: No column-level optimization, primarily single-threaded
- **Complex SQL**: Combinations of GroupBy + JOIN are cumbersome to write

DuckDB solves these problems. It can query Parquet and CSV files directly on disk, and handles datasets that exceed available memory through out-of-core processing.

### The Heaviness of Spark

Apache Spark is the go-to for big data, but it is often overkill for single-machine use cases.

- Cluster setup is required
- JVM overhead
- Slow startup

DuckDB runs with just `pip install` and can deliver analytical speeds comparable to Spark on a single machine.

## Key Use Cases

| Use Case | Example |
|----------|---------|
| Local data exploration | Interactively explore CSV/Parquet in Jupyter Notebook |
| ETL pipelines | Intermediate processing for data transformation and cleansing |
| BI and reporting | Fully local data transformation with dbt + DuckDB |
| Cloud storage analysis | Query files on S3/GCS directly without downloading |
| Embedded analytics | Embed a SQL engine within an application |
| AI/ML preprocessing | Transform and aggregate data for vector databases |

## Getting Started

Here are the steps for installation and basic verification.

**Installation**

```bash
pip install duckdb

No additional dependencies are needed — this is all it takes. You can start executing SQL immediately after installation.

$ python3 -c "
import duckdb
result = duckdb.sql('SELECT 42 AS answer, version() AS duckdb_version').fetchdf()
print(result)
"

Output:

   answer duckdb_version
0      42         v1.4.4

The CLI version works the same way.

$ duckdb --version
v1.4.4 (Andium) 6ddac802ff

$ duckdb -c "SELECT 'Hello, DuckDB!' AS greeting;"
┌────────────────┐
│    greeting    │
│    varchar     │
├────────────────┤
│ Hello, DuckDB! │
└────────────────┘

No server to start — just run SQL with a single command.

DuckDB Ecosystem

DuckDB goes beyond a standalone tool, with rich integration across surrounding tools.

DuckDB
  ├── Input
  │     ├── CSV / TSV (read_csv_auto)
  │     ├── Parquet (read_parquet)
  │     ├── JSON (read_json_auto)
  │     ├── S3 / GCS / Azure Blob (httpfs extension)
  │     └── Apache Iceberg (iceberg extension)
  ├── Output
  │     ├── Pandas DataFrame
  │     ├── Polars DataFrame
  │     └── Apache Arrow
  └── Integration Tools
        ├── dbt (dbt-duckdb adapter)
        ├── MotherDuck (cloud version)
        └── DuckDB-WASM (browser version)

Summary

The reasons to use DuckDB are clear.

  • No server required — get started with just pip install
  • Query CSV / Parquet / JSON directly with SQL
  • Faster than Pandas, lighter than Spark
  • Runs well even in free-tier environments

In the next article, we will walk through installation to basic operations, getting hands-on with both the CLI and Python API.

References

Suggest an edit on GitHub