I like to spend time on Linkedin reading through posts from companies about news, new releases, funding, new products, etc.. During a recent perusal, I saw something about “
DuckDB is briefly described as SQLite for analytic workloads. While SQLite is an embeddable, row-based, and b-tree indexed data store well suited for OLTP workloads, DuckDB is an embeddable column-based data store that uses vectorized processing to optimize OLAP workloads, you could also think of it as an embedded
OLTP |
OLAP |
---|---|
For your operation workloads |
For your analytic workloads |
Shorter queries |
Longer queries for complex questions |
Tables are more highly normalized |
Tables are de-normalized |
Typically implemented as row-oriented data stores |
Typically implemented as column-oriented data stores |
I decided to use their very clever WASM
Then I did some basic SQL to check it out:
duckdb> select count(*) from userdata1.parquet;
┌──────────────┐
│ count_star() │
╞══════════════╡
│ 1000 │
└──────────────┘
Elapsed: 1 ms
duckdb>
duckdb> select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌────────────┬───────────┬────────────────────────────┐
│ first_name ┆ last_name ┆ email │
╞════════════╪═══════════╪════════════════════════════╡
│ Emily ┆ Stewart ┆ [email protected] │
│ Annie ┆ Torres ┆ [email protected] │
│ William ┆ Green ┆ [email protected] │
│ Jack ┆ Medina ┆ [email protected] │
│ Jeremy ┆ Bennett ┆ [email protected] │
│ Carlos ┆ Day ┆ [email protected] │
│ Ryan ┆ Mills ┆ [email protected] │
│ Betty ┆ Gibson ┆ [email protected] │
│ Wanda ┆ Stanley ┆ [email protected] │
│ Evelyn ┆ Spencer ┆ [email protected] │
│ George ┆ Howard ┆ [email protected] │
└────────────┴───────────┴────────────────────────────┘
Elapsed: 2 ms
You can even do an ‘explain’:
duckdb> explain select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ first_name │
│ last_name │
│ email │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ country │
│ first_name │
│ last_name │
│ email │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: country=Nigeria │
│ AND country IS NOT NULL │
└───────────────────────────┘
A very nice feature with DuckDB is if you are working with Python for example, you can just add it as a library by adding “import duckdb” to your python script, and then it is in your python process, so it then feels very integrated into your program, unlike working with something like MySQL or Postgres. So, using our userdata1.parquet file, we could do something like this:
import duckdb
myconnector = duckdb.connect('myduckdb.duckdb')
cursor = myconnector.cursor()
cursor.execute("""
CREATE TABLE userdata(
registration_dttm date,
Id int,
first_name varchar,
Last_name varchar,
email varchar,
gender varchar,
ip_address varchar,
cc varchar,
country varchar,
birthdate varchar,
salary float,
title varchar,
comments
)
"""
)
cursor.execute("COPY userdata FROM 'userdata1.parquet' (HEADER)")
print(cursor.execute('select count(*) from userdata).fetchall())
cursor.close()
conn.close()
In the code snippet shown above, we connect to the ‘myduckdb.duckdb’ database, create a table that matches our parquet file, copy the data into it and then perform a simple count query.
This is a really cool project. While I’ve been aware of the advantages of columnar data stores for about eight years because of Sisense initially, I only started working with them more extensively in the past year. I’m especially excited by their WASM implementation and the clever things they did with