Introduction
Apache Iceberg is an open table format designed for large-scale analytical workloads, providing features such as schema management, time travel, and partition management.
DuckDB added write support for Iceberg tables via its Iceberg extension starting with v1.4.0, and extended support to UPDATE and DELETE operations in v1.4.2. This means DuckDB alone can now handle the entire lifecycle of Iceberg tables.
In this article, I use DuckDB v1.4.4 to verify the following:
- Reading tables with
iceberg_scan() - Inspecting metadata with
iceberg_snapshots()/iceberg_metadata() - Time travel using
snapshot_from_id/snapshot_from_timestamp - CREATE TABLE / INSERT / UPDATE / DELETE via a REST catalog
For official documentation, see Iceberg Extension โ DuckDB .
Setting Up the Iceberg Extension
INSTALL iceberg;
LOAD iceberg;
The extension is automatically downloaded on first run. If already installed, LOAD iceberg; is all you need.
The available functions are as follows:
SELECT function_name, function_type
FROM duckdb_functions()
WHERE function_name LIKE 'iceberg%'
ORDER BY function_name;
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ function_name โ function_type โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค
โ iceberg_metadata โ table โ
โ iceberg_scan โ table โ
โ iceberg_snapshots โ table โ
โ iceberg_table_properties โ table โ
โ iceberg_to_ducklake โ table โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
Reading Tables with iceberg_scan
For test data, I created a local Iceberg table using PyIceberg and inserted fruit pricing data in two batches.
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, IntegerType, StringType, DoubleType
import pyarrow as pa
catalog = SqlCatalog(
'local',
**{
'uri': 'sqlite:////tmp/iceberg_test/catalog.db',
'warehouse': 'file:///tmp/iceberg_test/warehouse',
}
)
schema = Schema(
NestedField(field_id=1, name='id', field_type=IntegerType(), required=True),
NestedField(field_id=2, name='name', field_type=StringType()),
NestedField(field_id=3, name='price', field_type=DoubleType()),
NestedField(field_id=4, name='category', field_type=StringType()),
)
table = catalog.create_table('default.products', schema=schema)
# 1 ๅ็ฎ๏ผในใใใใทใงใใ 1๏ผ
pa_schema = pa.schema([
pa.field('id', pa.int32(), nullable=False),
pa.field('name', pa.string()),
pa.field('price', pa.float64()),
pa.field('category', pa.string()),
])
table.append(pa.table({
'id': pa.array([1, 2, 3, 4, 5], type=pa.int32()),
'name': pa.array(['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry']),
'price': pa.array([1.50, 0.80, 3.00, 5.00, 8.00]),
'category': pa.array(['Fruit', 'Fruit', 'Fruit', 'Fruit', 'Fruit']),
}, schema=pa_schema))
# 2 ๅ็ฎ๏ผในใใใใทใงใใ 2๏ผ
table.append(pa.table({
'id': pa.array([6, 7, 8], type=pa.int32()),
'name': pa.array(['Fig', 'Grape', 'Honeydew']),
'price': pa.array([4.50, 2.20, 6.80]),
'category': pa.array(['Fruit', 'Fruit', 'Melon']),
}, schema=pa_schema))
In DuckDB, you read the table by specifying the path to the metadata JSON file.
LOAD iceberg;
SELECT * FROM iceberg_scan(
'/tmp/iceberg_test/warehouse/default/products/metadata/00002-ce5e4a45-0c63-4322-9af4-9f0416977efd.metadata.json'
)
ORDER BY id;
โโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโ
โ id โ name โ price โ category โ
โ int32 โ varchar โ double โ varchar โ
โโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ Apple โ 1.5 โ Fruit โ
โ 2 โ Banana โ 0.8 โ Fruit โ
โ 3 โ Cherry โ 3.0 โ Fruit โ
โ 4 โ Date โ 5.0 โ Fruit โ
โ 5 โ Elderberry โ 8.0 โ Fruit โ
โ 6 โ Fig โ 4.5 โ Fruit โ
โ 7 โ Grape โ 2.2 โ Fruit โ
โ 8 โ Honeydew โ 6.8 โ Melon โ
โโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโ
Standard SQL syntax works as-is.
SELECT category, COUNT(*) AS count, ROUND(AVG(price), 2) AS avg_price, MAX(price) AS max_price
FROM iceberg_scan('/tmp/iceberg_test/warehouse/default/products/metadata/00002-...metadata.json')
GROUP BY category
ORDER BY category;
โโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ category โ count โ avg_price โ max_price โ
โโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ Fruit โ 7 โ 3.57 โ 8.0 โ
โ Melon โ 1 โ 6.8 โ 6.8 โ
โโโโโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโ
Inspecting Snapshots and Metadata
iceberg_snapshots()
SELECT sequence_number, snapshot_id, timestamp_ms
FROM iceberg_snapshots('/tmp/iceberg_test/warehouse/default/products/metadata/00002-...metadata.json');
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ sequence_number โ snapshot_id โ timestamp_ms โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 1 โ 6271676225834788955 โ 2026-03-02 07:09:33.521 โ
โ 2 โ 6235383406797039698 โ 2026-03-02 07:09:33.546 โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
Two snapshots were recorded, corresponding to the two append() calls.
iceberg_metadata()
This function lets you inspect details at the data file level.
SELECT manifest_sequence_number, status, file_path, file_format, record_count
FROM iceberg_metadata('/tmp/iceberg_test/warehouse/default/products/metadata/00002-...metadata.json')
ORDER BY manifest_sequence_number;
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ manifest_sequence_number โ status โ file_path โ file_format โ record_count โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโค
โ 1 โ ADDED โ file:///tmp/.../data/00000-0-92d0c825-...parquet โ PARQUET โ 5 โ
โ 2 โ ADDED โ file:///tmp/.../data/00000-0-ca64ef58-...parquet โ PARQUET โ 3 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
The Parquet files corresponding to each snapshot are registered with an ADDED status. This reveals Iceberg’s internal structure: a hierarchy of snapshots, manifests, and data files.
Iceberg Table
โโโ metadata/
โโโ 00000-....metadata.json โ Table definition
โโโ 00001-....metadata.json โ Snapshot 1
โโโ 00002-....metadata.json โ Snapshot 2 (latest)
โโโ snap-xxxx.avro โ Manifest list
โโโ xxxx-m0.avro โ Manifest (data file listing)
data/
โโโ 00000-0-xxxx.parquet โ Batch 1 (5 rows)
โโโ 00000-0-yyyy.parquet โ Batch 2 (3 rows)
Time Travel
Specifying a Snapshot with snapshot_from_id
-- Snapshot 1 (5 rows)
SELECT * FROM iceberg_scan(
'/tmp/iceberg_test/warehouse/default/products/metadata/00002-...metadata.json',
snapshot_from_id=6271676225834788955
)
ORDER BY id;
โโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโ
โ id โ name โ price โ category โ
โโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโค
โ 1 โ Apple โ 1.5 โ Fruit โ
โ 2 โ Banana โ 0.8 โ Fruit โ
โ 3 โ Cherry โ 3.0 โ Fruit โ
โ 4 โ Date โ 5.0 โ Fruit โ
โ 5 โ Elderberry โ 8.0 โ Fruit โ
โโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโ
While the latest snapshot contains 8 rows, this query retrieved the 5 rows from the state at snapshot 1.
Specifying a Timestamp with snapshot_from_timestamp
-- Specify a point in time before the second INSERT
SELECT * FROM iceberg_scan(
'/tmp/iceberg_test/warehouse/default/products/metadata/00002-...metadata.json',
snapshot_from_timestamp='2026-03-02 07:09:33.530'
)
ORDER BY id;
You can time travel by specifying a timestamp without needing to look up the snapshot ID. The most recent snapshot at or before the specified timestamp is used.
Connecting to a REST Catalog and Write Operations
Write operations require an Iceberg REST catalog. Here, I used tabulario/iceberg-rest running locally for verification.
docker run -d --name iceberg-rest \
-p 8181:8181 \
-v /tmp/iceberg_warehouse:/tmp/iceberg_warehouse \
-e CATALOG_WAREHOUSE=/tmp/iceberg_warehouse \
tabulario/iceberg-rest:latest
To connect from DuckDB, specify TYPE ICEBERG and ENDPOINT in the ATTACH statement.
LOAD iceberg;
ATTACH '' AS iceberg_rest (
TYPE ICEBERG,
ENDPOINT 'http://localhost:8181',
AUTHORIZATION_TYPE NONE
);
Once connected, you can operate on tables using standard DuckDB SQL syntax.
CREATE TABLE / INSERT
CREATE SCHEMA IF NOT EXISTS iceberg_rest.demo;
CREATE TABLE iceberg_rest.demo.sales (
order_id INTEGER,
product VARCHAR,
quantity INTEGER,
amount DOUBLE,
region VARCHAR
);
INSERT INTO iceberg_rest.demo.sales VALUES
(1, 'Laptop', 2, 2400.00, 'East'),
(2, 'Mouse', 10, 150.00, 'West'),
(3, 'Keyboard', 5, 375.00, 'East'),
(4, 'Monitor', 3, 1200.00, 'North'),
(5, 'Headset', 8, 480.00, 'West');
SELECT * FROM iceberg_rest.demo.sales ORDER BY order_id;
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโ
โ order_id โ product โ quantity โ amount โ region โ
โโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโค
โ 1 โ Laptop โ 2 โ 2400.0 โ East โ
โ 2 โ Mouse โ 10 โ 150.0 โ West โ
โ 3 โ Keyboard โ 5 โ 375.0 โ East โ
โ 4 โ Monitor โ 3 โ 1200.0 โ North โ
โ 5 โ Headset โ 8 โ 480.0 โ West โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโ
UPDATE / DELETE (v1.4.2 and later)
-- Update the price
UPDATE iceberg_rest.demo.sales SET amount = 2600.00 WHERE order_id = 1;
-- Delete by region
DELETE FROM iceberg_rest.demo.sales WHERE region = 'West';
SELECT * FROM iceberg_rest.demo.sales ORDER BY order_id;
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโ
โ order_id โ product โ quantity โ amount โ region โ
โโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโค
โ 1 โ Laptop โ 2 โ 2600.0 โ East โ
โ 3 โ Keyboard โ 5 โ 375.0 โ East โ
โ 4 โ Monitor โ 3 โ 1200.0 โ North โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโ
The two rows in the West region were deleted, and the Laptop’s amount was updated from 2400.0 to 2600.0.
Checking Snapshots After Operations
SELECT sequence_number, snapshot_id, timestamp_ms
FROM iceberg_snapshots(iceberg_rest.demo.sales)
ORDER BY sequence_number;
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ sequence_number โ snapshot_id โ timestamp_ms โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 1 โ 3015870820437673917 โ 2026-03-02 07:12:20.701 โ
โ 2 โ 8210928633681103167 โ 2026-03-02 07:12:27.015 โ
โ 3 โ 7873958590094721501 โ 2026-03-02 07:12:32.734 โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโ
Three snapshots were recorded, corresponding to the INSERT, UPDATE, and DELETE operations. Using snapshot_from_id, you can revert to the state before any of these operations.
-- State right after INSERT (5 rows, amount=2400)
SELECT * FROM iceberg_scan(iceberg_rest.demo.sales, snapshot_from_id=3015870820437673917)
ORDER BY order_id;
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโ
โ order_id โ product โ quantity โ amount โ region โ
โโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโค
โ 1 โ Laptop โ 2 โ 2400.0 โ East โ
โ 2 โ Mouse โ 10 โ 150.0 โ West โ
โ 3 โ Keyboard โ 5 โ 375.0 โ East โ
โ 4 โ Monitor โ 3 โ 1200.0 โ North โ
โ 5 โ Headset โ 8 โ 480.0 โ West โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโ
Summary
Here is a summary of the Iceberg extension capabilities verified with DuckDB v1.4.4.
| Operation | Function / Syntax | Notes |
|---|---|---|
| Read | iceberg_scan() |
Specify metadata JSON path |
| Snapshot listing | iceberg_snapshots() |
File path or table reference |
| Data file inspection | iceberg_metadata() |
Manifest and Parquet file listing |
| Time travel (by ID) | snapshot_from_id= |
Parameter of iceberg_scan |
| Time travel (by time) | snapshot_from_timestamp= |
Parameter of iceberg_scan |
| Create table | CREATE TABLE |
Via REST catalog, v1.4.0+ |
| Insert data | INSERT INTO |
Via REST catalog, v1.4.0+ |
| Update | UPDATE |
Via REST catalog, v1.4.2+ |
| Delete | DELETE |
Via REST catalog, v1.4.2+ |
Read operations work without a REST catalog by specifying a direct path to the metadata JSON. Write operations require a connection to a REST catalog. On AWS, Amazon S3 Tables and Amazon SageMaker Lakehouse can be used as Iceberg REST catalogs.