Full support for querying Iceberg tables
Chris Atkins
Are you folks looking to support iceberg? I know there's the duckdb iceberg extension, but its in a very unfinished state (doesn't support catalogs, predicate pushdown, writes ...).
We have a customer-facing web-app that mostly deals with pre-aggregated data that we keep in postgres, but some views need to drill down to very small slices of the raw data. The raw data lives in a partitioned iceberg table on S3 (with glue catalog).
I can query it with Athena of course, and I indeed tried doing that, but the latency was all over the place.
Rather than running trino ourselves, I ended up writing a tiny java api using the iceberg java libraries and the duckdb jdbc connector. Basically for my queries, I use the iceberg library to figure out which are the relevant parquet files to scan, and then query those with a
read_parquet([ <the list of files> ])
, and present the results to the client.If duckdb/motherduck supported iceberg more robustly, we'd totally just throw away the little java service and use md!
We currently use the iceberg FindFiles helpers
specifically builder there: https://iceberg.apache.org/javadoc/1.5.2/org/apache/iceberg/FindFiles.Builder.html
and using the withRecordsMatching() method and applying the relevant filter expressions then build the
read_parquet()
and swap it into our query.Ideally i'd be able to just write a more boring
SELECT blah FROM my_iceberg_table WHERE a = 1 AND customer_id = 2 AND timestamp > :timestamp
or FROM iceberg_scan('my_table', catalog='glue', region='us-east-1')
.