Partition tables in MotherDuck storage
Ali Jahani
I am not a data engineer, I might be wrong about this but this was my experience. Please feel free to correct me or if you know a better way.
I was trying to follow https://www.youtube.com/watch?v=p_A3UYAm8Cg where it missed the partitioning functionality.
Imagine if I have an asset Foo and it is daily partitioned. In dagster, if I want to rematerialize Foo for a certain day, I have to make sure, first, I remove everything from that day in my table then insert all the data for that day in the table. querying and removing the data for those exact day might not be as efficient as removing the whole table for that day.
Having partitions saved in seperate tables would make implementation easier and less bug prone. Maybe somkething similar to hive partitioning in parquet files.
something similar to the case below(I noticed it is not allowed to have hyphen in the table name)
For tables like this I would like to do a query like:
"SELECT
FROM main" or "SELECT
FROM cases_ (some sort of regular expression)" when motherduck takes this query it should dynamically look for the tables with the same name under main or what matches the regular expression or those who have the property of same grouped and does a UNION for them.
ofcourse grouped tables schemas should be enforced to be always the same. I hope this helps. I would love to get your feedback.
J
Jan Soubusta
Deleting millions or even billions of rows is expensive.
Not sure how it works in MotherDuck(DuckDB), but in Vertica delete option creates so called delete vectors (fast operation), which must be resolved later when SELECTing from the same table, which is expensive, SELECT operations degrade with more and more delete vectors. Delete vectors are cleaned asynchronously, which is again very expensive operation, which can block user queries.
DROP PARTITION should be a simple catalog operation + deleting immutable files from the filesystem. No user queries should be affected.
MOVE PARTITION is valuable when archiving old partitions to different table.
It is valuable esp. when applying different maintenance rules on prod/archive tables, e.g. how to backup, how to cleanup, where to store (cold vs. hot storage), etc. Obviously, if such rules can be applied per partition, then it is not necessary to move partitions to different tables.