DuckDB vs. Bespoke Python
I recently encountered an interview problem that involved answering various analytical questions over some number of delimited data files. It mentioned running on a laptop, and asked the interviewee to consider scalability in terms of data volume and extensibility in terms of answering new questions. Like many interview problems, the suggested solution hadn't been updated in a while (years), and was looking for a bespoke Python solution.
There are many modern data processing libraries to choose from for this type of problem -
Polars, DuckDB, Pandas, or Pyspark for Python,
scala-polars or Tablesaw for Java,
even gota for golang or Danfo.js for javascript.
Speaking from experience, the first 3 are very easy to use, and the first 2 are extremely fast on moderately sized datasets.
Let’s take a look at DuckDB.

Setup#
I wanted to use non-proprietary data so that anybody could reproduce the results. So I downloaded this credit card transaction data generator. It generates data with plenty of interesting facets to exercise various queries. I generated 4 data sets. The raw data is in csv files and takes up the following disk space:
➜ du -h data_*
31G data_100mm_txn
3.1G data_10mm_txn
314M data_1mm_txn
62G data_200mm_txn
The data generator doesn't provide an easy way to directly request a specific number of data records, only a number of customers, so I had to iterate a few times to get the data set sizes I wanted.
For roughly 1mm records, I landed on 1200 customers:
python datagen.py -n 1200 -o gen_data/data_1mm_txn 01-01-2023 12-31-2023
This produces a collection of csv files with a total record count of 1,054,548.
I just scaled 1200 up for the other data sets.
A sample of one file:
➜ head data_1mm_txn/adults_2550_female_rural_0000-0239.csv
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|bdadffbae7329592393b079c5827490e|2023-08-01|01:34:03|1690871643|grocery_pos|370.53|1|fraud_Kovacek, Dibbert and Ondricka|38.283713|-80.810153
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|629fbb975c2ee41a9345f01afba7e6cf|2023-08-01|01:24:17|1690871057|misc_net|339.88|1|fraud_Kuphal-Predovic|37.196506|-81.913850
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|b78ba4de847d0adfd821c6ccfe261477|2023-08-01|02:39:52|1690875592|misc_pos|873.23|1|fraud_Eichmann-Russel|37.519118|-81.765044
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|6aac07c0b9182547c34676ab24b32a81|2023-08-01|01:07:41|1690870061|shopping_pos|933.84|1|fraud_Beier-Hyatt|38.361891|-81.390966
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|a5cd41f69faaeffe2a0a3a7b58918e3b|2023-08-01|11:26:02|1690907162|shopping_net|1077.35|1|fraud_Kerluke-Abshire|38.148218|-81.294663
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|ec72a087177503da0a5eee37e1d236e8|2023-08-01|01:12:46|1690870366|grocery_pos|838.74|1|fraud_Doyle Ltd|37.900884|-81.505876
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|39cb621398e00c3059e1bf3d4527fcb3|2023-08-02|20:51:54|1691027514|shopping_pos|1002.48|1|fraud_Friesen Inc|38.328737|-81.958906
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|5707a4463909b44faf5c62740e70ba4a|2023-08-02|21:23:06|1691029386|health_fitness|577.38|1|fraud_Greenholt Ltd|37.390618|-80.675238
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|9af5738414987cc043e504067843ad61|2023-08-02|23:49:09|1691038149|kids_pets|585.89|1|fraud_Bernier and Sons|37.342443|-80.712102
These datasets are significantly wider than the data described in the original problem.
Data Exploration#
The first advantage these libraries have is how quickly you can get to something useful. DuckDB can directly run SQL over a set of CSV files.
➜ duckdb -c "SELECT * FROM 'data_1mm_txn/*adults_*.csv' LIMIT 10;"
┌─────────────┬─────────────────┬─────────┬─────────┬─────────┬─────────────────────┬───────────┬───┬────────────┬──────────────┬────────┬──────────┬──────────────────────┬───────────┬────────────┐
│ ssn │ cc_num │ first │ last │ gender │ street │ city │ … │ unix_time │ category │ amt │ is_fraud │ merchant │ merch_lat │ merch_long │
│ varchar │ int64 │ varchar │ varchar │ boolean │ varchar │ varchar │ │ int64 │ varchar │ double │ int64 │ varchar │ double │ double │
├─────────────┼─────────────────┼─────────┼─────────┼─────────┼─────────────────────┼───────────┼───┼────────────┼──────────────┼────────┼──────────┼──────────────────────┼───────────┼────────────┤
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682148836 │ grocery_pos │ 5.83 │ 1 │ fraud_Cole PLC │ 33.084605 │ -84.59263 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682147498 │ grocery_pos │ 329.81 │ 1 │ fraud_Stracke-Lemke │ 32.909412 │ -83.336348 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682150282 │ misc_net │ 341.76 │ 1 │ fraud_Kuhn LLC │ 32.869572 │ -84.365167 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682146929 │ shopping_net │ 832.34 │ 1 │ fraud_Kuhic LLC │ 33.540181 │ -83.670669 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682139650 │ grocery_net │ 915.86 │ 1 │ fraud_Swift, Bradt… │ 34.058672 │ -83.259417 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682151402 │ misc_pos │ 915.74 │ 1 │ fraud_Gutmann-Upton │ 33.45225 │ -82.707481 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682306219 │ food_dining │ 552.77 │ 1 │ fraud_O'Hara-Wilde… │ 33.054957 │ -83.575527 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682305693 │ home │ 120.65 │ 1 │ fraud_Reilly LLC │ 33.433593 │ -83.425256 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682288784 │ food_dining │ 107.78 │ 1 │ fraud_Koss, McLaug… │ 34.060598 │ -83.46855 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ false │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682308774 │ food_dining │ 140.22 │ 1 │ fraud_Powlowski-We… │ 34.303291 │ -82.917523 │
├─────────────┴─────────────────┴─────────┴─────────┴─────────┴─────────────────────┴───────────┴───┴────────────┴──────────────┴────────┴──────────┴──────────────────────┴───────────┴────────────┤
│ 10 rows 26 columns (14 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Not bad! Notice that it figured out a bunch of stuff by itself:
- the fact that there is a header row
- the delimiter
- the type of each column
It does this by looking at a few (20,480 by default) records. It did get a couple of things wrong:
- gender “F” was interpreted as a boolean (false)
- zip was interpreted as an int (really needs to be varchar because zips can start with 0)
- maybe the same problem with cc_num and acct_num
These issues are easy to fix (note how types were specified this time).
➜ duckdb -c "SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar', 'zip': 'varchar'}) LIMIT 10;"
┌─────────────┬─────────────────┬─────────┬─────────┬─────────┬─────────────────────┬───────────┬───┬────────────┬──────────────┬────────┬──────────┬──────────────────────┬───────────┬────────────┐
│ ssn │ cc_num │ first │ last │ gender │ street │ city │ … │ unix_time │ category │ amt │ is_fraud │ merchant │ merch_lat │ merch_long │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ │ int64 │ varchar │ double │ int64 │ varchar │ double │ double │
├─────────────┼─────────────────┼─────────┼─────────┼─────────┼─────────────────────┼───────────┼───┼────────────┼──────────────┼────────┼──────────┼──────────────────────┼───────────┼────────────┤
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682148836 │ grocery_pos │ 5.83 │ 1 │ fraud_Cole PLC │ 33.084605 │ -84.59263 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682147498 │ grocery_pos │ 329.81 │ 1 │ fraud_Stracke-Lemke │ 32.909412 │ -83.336348 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682150282 │ misc_net │ 341.76 │ 1 │ fraud_Kuhn LLC │ 32.869572 │ -84.365167 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682146929 │ shopping_net │ 832.34 │ 1 │ fraud_Kuhic LLC │ 33.540181 │ -83.670669 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682139650 │ grocery_net │ 915.86 │ 1 │ fraud_Swift, Bradt… │ 34.058672 │ -83.259417 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682151402 │ misc_pos │ 915.74 │ 1 │ fraud_Gutmann-Upton │ 33.45225 │ -82.707481 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682306219 │ food_dining │ 552.77 │ 1 │ fraud_O'Hara-Wilde… │ 33.054957 │ -83.575527 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682305693 │ home │ 120.65 │ 1 │ fraud_Reilly LLC │ 33.433593 │ -83.425256 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682288784 │ food_dining │ 107.78 │ 1 │ fraud_Koss, McLaug… │ 34.060598 │ -83.46855 │
│ 424-64-6205 │ 371793271040546 │ Tiffany │ Jones │ F │ 2181 Wheeler Ridges │ Good Hope │ … │ 1682308774 │ food_dining │ 140.22 │ 1 │ fraud_Powlowski-We… │ 34.303291 │ -82.917523 │
├─────────────┴─────────────────┴─────────┴─────────┴─────────┴─────────────────────┴───────────┴───┴────────────┴──────────────┴────────┴──────────┴──────────────────────┴───────────┴────────────┤
│ 10 rows 26 columns (14 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
We can also ask for a summarization of the data:
➜ duckdb -c "summarize SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar', 'zip': 'varchar'}, header=true);"
┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬────────────────────┬────────────────────┬────────────────────┬─────────┬─────────────────┐
│ column_name │ column_type │ min │ max │ approx_unique │ … │ q25 │ q50 │ q75 │ count │ null_percentage │
│ varchar │ varchar │ varchar │ varchar │ int64 │ │ varchar │ varchar │ varchar │ int64 │ decimal(9,2) │
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼────────────────────┼────────────────────┼────────────────────┼─────────┼─────────────────┤
│ ssn │ VARCHAR │ 001-38-1810 │ 899-49-1248 │ 1211 │ … │ │ │ │ 1051343 │ 0.00 │
│ cc_num │ VARCHAR │ 060400268763 │ 676394485863 │ 1196 │ … │ │ │ │ 1051343 │ 0.00 │
│ first │ VARCHAR │ Aaron │ Zachary │ 402 │ … │ │ │ │ 1051343 │ 0.00 │
│ last │ VARCHAR │ Acevedo │ Zimmerman │ 550 │ … │ │ │ │ 1051343 │ 0.00 │
│ gender │ VARCHAR │ F │ M │ 2 │ … │ │ │ │ 1051343 │ 0.00 │
│ street │ VARCHAR │ 0009 Roman Hills S… │ 9992 Jason Terrace… │ 1190 │ … │ │ │ │ 1051343 │ 0.00 │
│ city │ VARCHAR │ Abilene │ Yukon │ 842 │ … │ │ │ │ 1051343 │ 0.00 │
│ state │ VARCHAR │ AK │ WY │ 51 │ … │ │ │ │ 1051343 │ 0.00 │
│ zip │ VARCHAR │ 01010 │ 99654 │ 1130 │ … │ │ │ │ 1051343 │ 0.00 │
│ lat │ DOUBLE │ 20.7441 │ 61.5923 │ 1117 │ … │ 33.78882546408646 │ 38.508994491911075 │ 41.18956704202229 │ 1051343 │ 0.00 │
│ long │ DOUBLE │ -159.6694 │ -68.4121 │ 1135 │ … │ -98.23946762402772 │ -87.06472635623118 │ -79.77149230109092 │ 1051343 │ 0.00 │
│ city_pop │ BIGINT │ 199 │ 2906700 │ 890 │ … │ 19416 │ 65061 │ 241033 │ 1051343 │ 0.00 │
│ job │ VARCHAR │ Academic librarian │ Youth worker │ 534 │ … │ │ │ │ 1051343 │ 0.00 │
│ dob │ DATE │ 1928-10-13 │ 2008-11-22 │ 1183 │ … │ │ │ │ 1051343 │ 0.00 │
│ acct_num │ VARCHAR │ 10258404917 │ 999901662733 │ 1203 │ … │ │ │ │ 1051343 │ 0.00 │
│ profile │ VARCHAR │ adults_2550_female… │ young_adults_male_… │ 12 │ … │ │ │ │ 1051343 │ 0.00 │
│ trans_num │ VARCHAR │ 00001340d8e6b1766b… │ fffff990530a101168… │ 1080518 │ … │ │ │ │ 1051343 │ 0.00 │
│ trans_date │ DATE │ 2023-01-01 │ 2023-12-31 │ 366 │ … │ │ │ │ 1051343 │ 0.00 │
│ trans_time │ TIME │ 00:00:00 │ 23:59:59 │ 88400 │ … │ │ │ │ 1051343 │ 0.00 │
│ unix_time │ BIGINT │ 1672552808 │ 1704088796 │ 1009172 │ … │ 1682651789 │ 1690184881 │ 1698137297 │ 1051343 │ 0.00 │
│ category │ VARCHAR │ entertainment │ travel │ 14 │ … │ │ │ │ 1051343 │ 0.00 │
│ amt │ DOUBLE │ 1.0 │ 26810.71 │ 52142 │ … │ 9.175574441235733 │ 44.90613575342282 │ 82.69103793265246 │ 1051343 │ 0.00 │
│ is_fraud │ BIGINT │ 0 │ 1 │ 2 │ … │ 0 │ 0 │ 0 │ 1051343 │ 0.00 │
│ merchant │ VARCHAR │ fraud_Abbott-Rogahn │ fraud_Zulauf LLC │ 716 │ … │ │ │ │ 1051343 │ 0.00 │
│ merch_lat │ DOUBLE │ 19.845823 │ 62.589931 │ 1026312 │ … │ 33.69838307624862 │ 38.44309616618485 │ 41.26469799023738 │ 1051343 │ 0.00 │
│ merch_long │ DOUBLE │ -160.669149 │ -67.413594 │ 1053435 │ … │ -98.36918390322704 │ -86.93663111949701 │ -79.54040155192297 │ 1051343 │ 0.00 │
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴────────────────────┴────────────────────┴────────────────────┴─────────┴─────────────────┤
│ 26 rows 12 columns (10 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Evaluation#
The problem had two main areas of evaluation - dealing with data scale, and extensibility to support changing/additional requirements.
Runtime Performance - Raw CSV#
As we’ve seen, duckdb can run queries directly on a set of csv files. Lets see how this performs with a query similar to the first one called for in the original problem.
➜ time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_1mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────┐
│ sum(amt) │
│ double │
├───────────┤
│ 160771.07 │
└───────────┘
duckdb -c 4.21s user 0.09s system 690% cpu 0.623 total
max memory: 303696 KB
➜ time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_10mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌────────────────────┐
│ sum(amt) │
│ double │
├────────────────────┤
│ 1529945.0599999998 │
└────────────────────┘
duckdb -c 14.31s user 0.58s system 780% cpu 1.909 total
max memory: 1270608 KB
➜ time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_100mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────────────┐
│ sum(amt) │
│ double │
├───────────────────┤
│ 16001193.47999998 │
└───────────────────┘
duckdb -c 42.19s user 8.07s system 463% cpu 10.853 total
max memory: 6491968 KB
➜ time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_200mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────────────┐
│ sum(amt) │
│ double │
├───────────────────┤
│ 32126444.01999999 │
└───────────────────┘
duckdb -c 74.13s user 17.85s system 418% cpu 21.984 total
max memory: 10977920 KB
Not bad, but not great either - as a user, I really want these queries finishing in well under 5 seconds. I’m not too concerned about how much CPU and memory it uses as long as it successfully runs on my laptop. Note that it did use multiple CPUs in parallel, without me having to build that. It's not too surprising that the performance was poor - CSV is not optimized for ad-hoc querying (or really much of anything other than pulling into a spreadsheet). To summarize:
| Num Records | User Time (s) | Memory Used (GB) |
|---|---|---|
| 1mm | 4.21 | 0.29 |
| 10mm | 14.31 | 1.21 |
| 100mm | 42.19 | 6.19 |
| 200mm | 74.13 | 10.47 |
Runtime Performance - Native DuckDB#
We can easily import data from CSV to a native DuckDB table file. Let's see if that is any better.
Load Data#
➜ duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .timer on
D .open data_1mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 1.674 user 7.352345 sys 0.190694
D .open data_10mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_10mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 7.230 user 42.859433 sys 1.095903
D .open data_100mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_100mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 31.078 user 262.523123 sys 10.148275
D .open data_200mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_200mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 62.741 user 494.657874 sys 22.505040
The time taken to load is quite a bit longer than to run a single query, so we will want the query speed to be dramatically improved to justify it. Though presumably we would be querying much more frequently than loading massive amounts of more CSV data.
➜ ls -lh *.db
-rw-r--r-- 1 rmckay staff 4.9G Aug 6 17:35 data_100mm_txn.db
-rw-r--r-- 1 rmckay staff 713M Aug 6 17:33 data_10mm_txn.db
-rw-r--r-- 1 rmckay staff 48M Aug 6 17:32 data_1mm_txn.db
-rw-r--r-- 1 rmckay staff 9.4G Aug 6 17:37 data_200mm_txn.db
The resulting files are much smaller than the source CSV data.
Run Queries#
➜ time duckdb data_1mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────┐
│ sum(amt) │
│ double │
├───────────┤
│ 160771.07 │
└───────────┘
duckdb data_1mm_txn.db 0.04s user 0.01s system 123% cpu 0.039 total
max memory: 25664 KB
➜ time duckdb data_10mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌────────────────────┐
│ sum(amt) │
│ double │
├────────────────────┤
│ 1529945.0600000003 │
└────────────────────┘
duckdb data_10mm_txn.db 0.22s user 0.03s system 378% cpu 0.065 total
max memory: 94928 KB
➜ time duckdb data_100mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌────────────────────┐
│ sum(amt) │
│ double │
├────────────────────┤
│ 16001193.479999982 │
└────────────────────┘
duckdb data_100mm_txn.db 1.73s user 0.17s system 708% cpu 0.268 total
max memory: 670576 KB
➜ time duckdb data_200mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────────────┐
│ sum(amt) │
│ double │
├───────────────────┤
│ 32126444.02000001 │
└───────────────────┘
duckdb data_200mm_txn.db 3.40s user 0.30s system 767% cpu 0.482 total
max memory: 1254576 KB
| Num Records | User Time (s) | Memory Used (GB) |
|---|---|---|
| 1mm | 0.04 | 0.024 |
| 10mm | 0.22 | 0.091 |
| 100mm | 1.73 | 0.640 |
| 200mm | 3.40 | 1.196 |
That was blazingly fast and used way less memory than the raw CSV approach.
Extensibility#
It’s easy to see that pretty much any query our users could think of would be easily supported. Just for fun lets run some stand-ins for the other queries called for in the original problem. We’ll just stick with the 200mm dataset for all of these.
➜ duckdb data_200mm_txn.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D .timer on
D SELECT category, COUNT(*) FROM txn GROUP BY category;
┌────────────────┬──────────────┐
│ category │ count_star() │
│ varchar │ int64 │
├────────────────┼──────────────┤
│ misc_pos │ 13667090 │
│ shopping_pos │ 20627532 │
│ grocery_pos │ 19843040 │
│ food_dining │ 15616003 │
│ personal_care │ 14653811 │
│ shopping_net │ 15559261 │
│ health_fitness │ 12955427 │
│ home │ 19517464 │
│ gas_transport │ 17922579 │
│ misc_net │ 9632379 │
│ travel │ 6645464 │
│ kids_pets │ 18068316 │
│ grocery_net │ 8885726 │
│ entertainment │ 15113737 │
├────────────────┴──────────────┤
│ 14 rows 2 columns │
└───────────────────────────────┘
Run Time (s): real 0.482 user 4.546682 sys 0.016826
D SELECT SUM(amt) FROM txn WHERE category='travel' AND trans_date between '2023-06-01' AND '2023-07-31';
┌────────────────────┐
│ sum(amt) │
│ double │
├────────────────────┤
│ 2926518.8900000015 │
└────────────────────┘
Run Time (s): real 0.324 user 3.115719 sys 0.006838
D SELECT SUM(amt) FROM txn WHERE trans_date between '2023-06-01' AND '2023-06-30';
┌────────────────────┐
│ sum(amt) │
│ double │
├────────────────────┤
│ 1344176312.4700007 │
└────────────────────┘
Run Time (s): real 0.171 user 1.495202 sys 0.064187
D SELECT category, ROUND(SUM(amt), 2) AS total_amount
FROM txn
WHERE trans_date between '2023-06-01' AND '2023-08-31'
GROUP BY category
ORDER BY total_amount DESC
LIMIT 5;
┌───────────────┬───────────────┐
│ category │ total_amount │
│ varchar │ double │
├───────────────┼───────────────┤
│ shopping_pos │ 1398272714.98 │
│ entertainment │ 986193998.59 │
│ shopping_net │ 874592892.17 │
│ food_dining │ 567098508.99 │
│ misc_pos │ 144910582.16 │
└───────────────┴───────────────┘
Run Time (s): real 0.337 user 3.183682 sys 0.008045
I used the DuckDB’s interactive mode this time. It is sitting at 1.27GB memory used according to Activity Monitor.
Conclusion#
The original problem mentioned running in AWS or on a user’s laptop. For both scenarios, I think distributing the data as a DuckDB database file and using DuckDB to query it would be an excellent solution. Certainly much, much better than CSV + bespoke summarizing logic written in Python. If you want to interview for big* data competence, look for familiarity with modern data libraries.
* most orgs' data isn't really that big