DuckDB vs. Bespoke Python

written in duckdb, python

I recently encountered an interview problem that involved answering various analytical questions over some number of delimited data files. It mentioned running on a laptop, and asked the interviewee to consider scalability in terms of data volume and extensibility in terms of answering new questions. Like many interview problems, the suggested solution hadn't been updated in a while (years), and was looking for a bespoke Python solution.

There are many modern data processing libraries to choose from for this type of problem - Polars, DuckDB, Pandas, or Pyspark for Python, scala-polars or Tablesaw for Java, even gota for golang or Danfo.js for javascript. Speaking from experience, the first 3 are very easy to use, and the first 2 are extremely fast on moderately sized datasets. Let’s take a look at DuckDB. DuckDB

Setup#

I wanted to use non-proprietary data so that anybody could reproduce the results. So I downloaded this credit card transaction data generator. It generates data with plenty of interesting facets to exercise various queries. I generated 4 data sets. The raw data is in csv files and takes up the following disk space:

 du -h data_*
31G     data_100mm_txn
3.1G    data_10mm_txn
314M    data_1mm_txn
62G     data_200mm_txn

The data generator doesn't provide an easy way to directly request a specific number of data records, only a number of customers, so I had to iterate a few times to get the data set sizes I wanted.
For roughly 1mm records, I landed on 1200 customers:

python datagen.py -n 1200 -o gen_data/data_1mm_txn 01-01-2023 12-31-2023

This produces a collection of csv files with a total record count of 1,054,548.
I just scaled 1200 up for the other data sets. A sample of one file:

➜ head data_1mm_txn/adults_2550_female_rural_0000-0239.csv
ssn|cc_num|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|acct_num|profile|trans_num|trans_date|trans_time|unix_time|category|amt|is_fraud|merchant|merch_lat|merch_long
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|bdadffbae7329592393b079c5827490e|2023-08-01|01:34:03|1690871643|grocery_pos|370.53|1|fraud_Kovacek, Dibbert and Ondricka|38.283713|-80.810153
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|629fbb975c2ee41a9345f01afba7e6cf|2023-08-01|01:24:17|1690871057|misc_net|339.88|1|fraud_Kuphal-Predovic|37.196506|-81.913850
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|b78ba4de847d0adfd821c6ccfe261477|2023-08-01|02:39:52|1690875592|misc_pos|873.23|1|fraud_Eichmann-Russel|37.519118|-81.765044
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|6aac07c0b9182547c34676ab24b32a81|2023-08-01|01:07:41|1690870061|shopping_pos|933.84|1|fraud_Beier-Hyatt|38.361891|-81.390966
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|a5cd41f69faaeffe2a0a3a7b58918e3b|2023-08-01|11:26:02|1690907162|shopping_net|1077.35|1|fraud_Kerluke-Abshire|38.148218|-81.294663
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|ec72a087177503da0a5eee37e1d236e8|2023-08-01|01:12:46|1690870366|grocery_pos|838.74|1|fraud_Doyle Ltd|37.900884|-81.505876
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|39cb621398e00c3059e1bf3d4527fcb3|2023-08-02|20:51:54|1691027514|shopping_pos|1002.48|1|fraud_Friesen Inc|38.328737|-81.958906
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|5707a4463909b44faf5c62740e70ba4a|2023-08-02|21:23:06|1691029386|health_fitness|577.38|1|fraud_Greenholt Ltd|37.390618|-80.675238
405-64-0484|6525324472099933|Maria|Porter|F|14626 Dawn Union Apt. 806|Smithers|WV|25186|38.1543|-81.278|881|Mechanical engineer|1999-03-18|599060683120|adults_2550_female_rural.json|9af5738414987cc043e504067843ad61|2023-08-02|23:49:09|1691038149|kids_pets|585.89|1|fraud_Bernier and Sons|37.342443|-80.712102

These datasets are significantly wider than the data described in the original problem.

Data Exploration#

The first advantage these libraries have is how quickly you can get to something useful. DuckDB can directly run SQL over a set of CSV files.

 duckdb -c "SELECT * FROM 'data_1mm_txn/*adults_*.csv' LIMIT 10;"
┌─────────────┬─────────────────┬─────────┬─────────┬─────────┬─────────────────────┬───────────┬───┬────────────┬──────────────┬────────┬──────────┬──────────────────────┬───────────┬────────────┐
│     ssn          cc_num        first    last    gender         street           city       unix_time     category     amt    is_fraud        merchant        merch_lat  merch_long │
│   varchar         int64       varchar  varchar  boolean        varchar         varchar        int64       varchar     double   int64          varchar          double      double   │
├─────────────┼─────────────────┼─────────┼─────────┼─────────┼─────────────────────┼───────────┼───┼────────────┼──────────────┼────────┼──────────┼──────────────────────┼───────────┼────────────┤
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682148836  grocery_pos     5.83         1  fraud_Cole PLC        33.084605   -84.59263 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682147498  grocery_pos   329.81         1  fraud_Stracke-Lemke   32.909412  -83.336348 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682150282  misc_net      341.76         1  fraud_Kuhn LLC        32.869572  -84.365167 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682146929  shopping_net  832.34         1  fraud_Kuhic LLC       33.540181  -83.670669 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682139650  grocery_net   915.86         1  fraud_Swift, Bradt…   34.058672  -83.259417 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682151402  misc_pos      915.74         1  fraud_Gutmann-Upton    33.45225  -82.707481 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682306219  food_dining   552.77         1  fraud_O'Hara-Wilde…   33.054957  -83.575527 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682305693  home          120.65         1  fraud_Reilly LLC      33.433593  -83.425256 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682288784  food_dining   107.78         1  fraud_Koss, McLaug…   34.060598   -83.46855 │
│ 424-64-6205  371793271040546  Tiffany  Jones    false    2181 Wheeler Ridges  Good Hope    1682308774  food_dining   140.22         1  fraud_Powlowski-We…   34.303291  -82.917523 │
├─────────────┴─────────────────┴─────────┴─────────┴─────────┴─────────────────────┴───────────┴───┴────────────┴──────────────┴────────┴──────────┴──────────────────────┴───────────┴────────────┤
│ 10 rows                                                                                                                                                                     26 columns (14 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Not bad! Notice that it figured out a bunch of stuff by itself:

  • the fact that there is a header row
  • the delimiter
  • the type of each column

It does this by looking at a few (20,480 by default) records. It did get a couple of things wrong:

  • gender “F” was interpreted as a boolean (false)
  • zip was interpreted as an int (really needs to be varchar because zips can start with 0)
  • maybe the same problem with cc_num and acct_num

These issues are easy to fix (note how types were specified this time).

 duckdb -c "SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar', 'zip': 'varchar'}) LIMIT 10;"
┌─────────────┬─────────────────┬─────────┬─────────┬─────────┬─────────────────────┬───────────┬───┬────────────┬──────────────┬────────┬──────────┬──────────────────────┬───────────┬────────────┐
     ssn          cc_num        first    last    gender         street           city       unix_time     category     amt    is_fraud        merchant        merch_lat  merch_long 
   varchar        varchar      varchar  varchar  varchar        varchar         varchar        int64       varchar     double   int64          varchar          double      double   
├─────────────┼─────────────────┼─────────┼─────────┼─────────┼─────────────────────┼───────────┼───┼────────────┼──────────────┼────────┼──────────┼──────────────────────┼───────────┼────────────┤
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682148836  grocery_pos     5.83         1  fraud_Cole PLC        33.084605   -84.59263 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682147498  grocery_pos   329.81         1  fraud_Stracke-Lemke   32.909412  -83.336348 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682150282  misc_net      341.76         1  fraud_Kuhn LLC        32.869572  -84.365167 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682146929  shopping_net  832.34         1  fraud_Kuhic LLC       33.540181  -83.670669 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682139650  grocery_net   915.86         1  fraud_Swift, Bradt   34.058672  -83.259417 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682151402  misc_pos      915.74         1  fraud_Gutmann-Upton    33.45225  -82.707481 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682306219  food_dining   552.77         1  fraud_O'Hara-Wilde…  │ 33.054957 │ -83.575527 │
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682305693  home          120.65         1  fraud_Reilly LLC      33.433593  -83.425256 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682288784  food_dining   107.78         1  fraud_Koss, McLaug   34.060598   -83.46855 
 424-64-6205  371793271040546  Tiffany  Jones    F        2181 Wheeler Ridges  Good Hope    1682308774  food_dining   140.22         1  fraud_Powlowski-We   34.303291  -82.917523 
├─────────────┴─────────────────┴─────────┴─────────┴─────────┴─────────────────────┴───────────┴───┴────────────┴──────────────┴────────┴──────────┴──────────────────────┴───────────┴────────────┤
 10 rows                                                                                                                                                                     26 columns (14 shown) 
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can also ask for a summarization of the data:

 duckdb -c "summarize SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar', 'zip': 'varchar'}, header=true);"
┌─────────────┬─────────────┬──────────────────────┬──────────────────────┬───────────────┬───┬────────────────────┬────────────────────┬────────────────────┬─────────┬─────────────────┐
 column_name  column_type          min                   max           approx_unique           q25                 q50                 q75           count   null_percentage 
   varchar      varchar          varchar               varchar             int64              varchar             varchar             varchar         int64    decimal(9,2)   
├─────────────┼─────────────┼──────────────────────┼──────────────────────┼───────────────┼───┼────────────────────┼────────────────────┼────────────────────┼─────────┼─────────────────┤
 ssn          VARCHAR      001-38-1810           899-49-1248                    1211                                                                1051343             0.00 
 cc_num       VARCHAR      060400268763          676394485863                   1196                                                                1051343             0.00 
 first        VARCHAR      Aaron                 Zachary                         402                                                                1051343             0.00 
 last         VARCHAR      Acevedo               Zimmerman                       550                                                                1051343             0.00 
 gender       VARCHAR      F                     M                                 2                                                                1051343             0.00 
 street       VARCHAR      0009 Roman Hills S   9992 Jason Terrace            1190                                                                1051343             0.00 
 city         VARCHAR      Abilene               Yukon                           842                                                                1051343             0.00 
 state        VARCHAR      AK                    WY                               51                                                                1051343             0.00 
 zip          VARCHAR      01010                 99654                          1130                                                                1051343             0.00 
 lat          DOUBLE       20.7441               61.5923                        1117    33.78882546408646   38.508994491911075  41.18956704202229   1051343             0.00 
 long         DOUBLE       -159.6694             -68.4121                       1135    -98.23946762402772  -87.06472635623118  -79.77149230109092  1051343             0.00 
 city_pop     BIGINT       199                   2906700                         890    19416               65061               241033              1051343             0.00 
 job          VARCHAR      Academic librarian    Youth worker                    534                                                                1051343             0.00 
 dob          DATE         1928-10-13            2008-11-22                     1183                                                                1051343             0.00 
 acct_num     VARCHAR      10258404917           999901662733                   1203                                                                1051343             0.00 
 profile      VARCHAR      adults_2550_female   young_adults_male_              12                                                                1051343             0.00 
 trans_num    VARCHAR      00001340d8e6b1766b   fffff990530a101168         1080518                                                                1051343             0.00 
 trans_date   DATE         2023-01-01            2023-12-31                      366                                                                1051343             0.00 
 trans_time   TIME         00:00:00              23:59:59                      88400                                                                1051343             0.00 
 unix_time    BIGINT       1672552808            1704088796                  1009172    1682651789          1690184881          1698137297          1051343             0.00 
 category     VARCHAR      entertainment         travel                           14                                                                1051343             0.00 
 amt          DOUBLE       1.0                   26810.71                      52142    9.175574441235733   44.90613575342282   82.69103793265246   1051343             0.00 
 is_fraud     BIGINT       0                     1                                 2    0                   0                   0                   1051343             0.00 
 merchant     VARCHAR      fraud_Abbott-Rogahn   fraud_Zulauf LLC                716                                                                1051343             0.00 
 merch_lat    DOUBLE       19.845823             62.589931                   1026312    33.69838307624862   38.44309616618485   41.26469799023738   1051343             0.00 
 merch_long   DOUBLE       -160.669149           -67.413594                  1053435    -98.36918390322704  -86.93663111949701  -79.54040155192297  1051343             0.00 
├─────────────┴─────────────┴──────────────────────┴──────────────────────┴───────────────┴───┴────────────────────┴────────────────────┴────────────────────┴─────────┴─────────────────┤
 26 rows                                                                                                                                                          12 columns (10 shown) 
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Evaluation#

The problem had two main areas of evaluation - dealing with data scale, and extensibility to support changing/additional requirements.

Runtime Performance - Raw CSV#

As we’ve seen, duckdb can run queries directly on a set of csv files. Lets see how this performs with a query similar to the first one called for in the original problem.

 time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_1mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
 ┌───────────┐
  sum(amt)  
  double    
 ├───────────┤
  160771.07 
 └───────────┘
 duckdb -c 4.21s user 0.09s system 690% cpu 0.623 total
 max memory: 303696 KB
 time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_10mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
 ┌────────────────────┐
  sum(amt)           
  double             
 ├────────────────────┤
  1529945.0599999998 
 └────────────────────┘
 duckdb -c 14.31s user 0.58s system 780% cpu 1.909 total
 max memory: 1270608 KB
 time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_100mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
 ┌───────────────────┐
  sum(amt)          
  double            
 ├───────────────────┤
  16001193.47999998 
 └───────────────────┘
 duckdb -c 42.19s user 8.07s system 463% cpu 10.853 total
 max memory: 6491968 KB
 time duckdb -c "SELECT SUM(amt) FROM READ_CSV('data_200mm_txn/*adults_*.csv', header = true) WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
 ┌───────────────────┐
  sum(amt)          
  double            
 ├───────────────────┤
  32126444.01999999 
 └───────────────────┘
 duckdb -c 74.13s user 17.85s system 418% cpu 21.984 total
 max memory: 10977920 KB

Not bad, but not great either - as a user, I really want these queries finishing in well under 5 seconds. I’m not too concerned about how much CPU and memory it uses as long as it successfully runs on my laptop. Note that it did use multiple CPUs in parallel, without me having to build that. It's not too surprising that the performance was poor - CSV is not optimized for ad-hoc querying (or really much of anything other than pulling into a spreadsheet). To summarize:

Num Records User Time (s) Memory Used (GB)
1mm 4.21 0.29
10mm 14.31 1.21
100mm 42.19 6.19
200mm 74.13 10.47

Runtime Performance - Native DuckDB#

We can easily import data from CSV to a native DuckDB table file. Let's see if that is any better.

Load Data#

 duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .timer on
D .open data_1mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_1mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 1.674 user 7.352345 sys 0.190694
D .open data_10mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_10mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 7.230 user 42.859433 sys 1.095903
D .open data_100mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_100mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 31.078 user 262.523123 sys 10.148275
D .open data_200mm_txn.db
D CREATE TABLE txn AS SELECT * FROM READ_CSV('data_200mm_txn/*adults_*.csv', types = {'gender': 'varchar', 'cc_num': 'varchar', 'acct_num': 'varchar'}, header = true);
Run Time (s): real 62.741 user 494.657874 sys 22.505040

The time taken to load is quite a bit longer than to run a single query, so we will want the query speed to be dramatically improved to justify it. Though presumably we would be querying much more frequently than loading massive amounts of more CSV data.

➜ ls -lh *.db
-rw-r--r-- 1 rmckay staff 4.9G Aug 6 17:35 data_100mm_txn.db
-rw-r--r-- 1 rmckay staff 713M Aug 6 17:33 data_10mm_txn.db
-rw-r--r-- 1 rmckay staff 48M Aug 6 17:32 data_1mm_txn.db
-rw-r--r-- 1 rmckay staff 9.4G Aug 6 17:37 data_200mm_txn.db

The resulting files are much smaller than the source CSV data.

Run Queries#

 time duckdb data_1mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────┐
 sum(amt)  
 double    
├───────────┤
 160771.07 
└───────────┘
duckdb data_1mm_txn.db 0.04s user 0.01s system 123% cpu 0.039 total
max memory: 25664 KB

 time duckdb data_10mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌────────────────────┐
 sum(amt)           
 double             
├────────────────────┤
 1529945.0600000003 
└────────────────────┘
duckdb data_10mm_txn.db 0.22s user 0.03s system 378% cpu 0.065 total
max memory: 94928 KB

 time duckdb data_100mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌────────────────────┐
 sum(amt)           
 double             
├────────────────────┤
 16001193.479999982 
└────────────────────┘
duckdb data_100mm_txn.db 1.73s user 0.17s system 708% cpu 0.268 total
max memory: 670576 KB

 time duckdb data_200mm_txn.db "SELECT SUM(amt) FROM txn WHERE category='grocery_pos' AND trans_date between '2023-07-01' AND '2023-08-31';"
┌───────────────────┐
 sum(amt)          
 double            
├───────────────────┤
 32126444.02000001 
└───────────────────┘
duckdb data_200mm_txn.db 3.40s user 0.30s system 767% cpu 0.482 total
max memory: 1254576 KB
Num Records User Time (s) Memory Used (GB)
1mm 0.04 0.024
10mm 0.22 0.091
100mm 1.73 0.640
200mm 3.40 1.196

That was blazingly fast and used way less memory than the raw CSV approach.

Extensibility#

It’s easy to see that pretty much any query our users could think of would be easily supported. Just for fun lets run some stand-ins for the other queries called for in the original problem. We’ll just stick with the 200mm dataset for all of these.

 duckdb data_200mm_txn.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D .timer on
D SELECT category, COUNT(*) FROM txn GROUP BY category;
┌────────────────┬──────────────┐
 category        count_star() 
 varchar         int64        
├────────────────┼──────────────┤
 misc_pos        13667090     
 shopping_pos    20627532     
 grocery_pos     19843040     
 food_dining     15616003     
 personal_care   14653811     
 shopping_net    15559261     
 health_fitness  12955427     
 home            19517464     
 gas_transport   17922579     
 misc_net        9632379      
 travel          6645464      
 kids_pets       18068316     
 grocery_net     8885726      
 entertainment   15113737     
├────────────────┴──────────────┤
 14 rows 2 columns             
└───────────────────────────────┘
Run Time (s): real 0.482 user 4.546682 sys 0.016826
D SELECT SUM(amt) FROM txn WHERE category='travel' AND trans_date between '2023-06-01' AND '2023-07-31';
┌────────────────────┐
 sum(amt)           
 double             
├────────────────────┤
 2926518.8900000015 
└────────────────────┘
Run Time (s): real 0.324 user 3.115719 sys 0.006838
D SELECT SUM(amt) FROM txn WHERE trans_date between '2023-06-01' AND '2023-06-30';
┌────────────────────┐
 sum(amt)           
 double             
├────────────────────┤
 1344176312.4700007 
└────────────────────┘
Run Time (s): real 0.171 user 1.495202 sys 0.064187
D SELECT category, ROUND(SUM(amt), 2) AS total_amount
FROM txn
WHERE trans_date between '2023-06-01' AND '2023-08-31'
GROUP BY category
ORDER BY total_amount DESC
LIMIT 5;
┌───────────────┬───────────────┐
 category       total_amount  
 varchar        double        
├───────────────┼───────────────┤
 shopping_pos   1398272714.98 
 entertainment  986193998.59  
 shopping_net   874592892.17  
 food_dining    567098508.99  
 misc_pos       144910582.16  
└───────────────┴───────────────┘
Run Time (s): real 0.337 user 3.183682 sys 0.008045

I used the DuckDB’s interactive mode this time. It is sitting at 1.27GB memory used according to Activity Monitor.

Conclusion#

The original problem mentioned running in AWS or on a user’s laptop. For both scenarios, I think distributing the data as a DuckDB database file and using DuckDB to query it would be an excellent solution. Certainly much, much better than CSV + bespoke summarizing logic written in Python. If you want to interview for big* data competence, look for familiarity with modern data libraries.

* most orgs' data isn't really that big