β The Problem: Large Snapshots with Too Many Partitions
In large dbt projects using snapshots (especially slowly changing dimensions or fact snapshots), each change can create a new record per primary key per day (or other grain), leading to:
- β οΈ Thousands or millions of tiny partitions
- π Poor performance for queries on recent data
- πΈ Higher storage costs
- π Slower incremental loads
π― The Goal: Keep history, but merge old partitions intelligently
We want to consolidate older partitions while preserving key metadata for historical analysis (e.g.,
dbt_valid_from
,dbt_valid_to
,dbt_updated_at
).
π§ͺ Example Scenario
Imagine you have a snapshot of fct_orders_snapshot
with the following structure:
order_id | status | dbt_valid_from | dbt_valid_to |
---|---|---|---|
1001 | shipped | 2024-01-01 | 2024-03-01 |
1001 | delayed | 2024-03-01 | 2024-05-01 |
1001 | shipped | 2024-05-01 | null |
Each of these records lives in a separate daily partition, like:
snapshot_date = '2024-01-01'
snapshot_date = '2024-03-01'
snapshot_date = '2024-05-01'
Over time, the table becomes fragmented.
β The Strategy: Partition Merging with Metadata Preservation
Merge old snapshot partitions into monthly or quarterly partitions, but keep the original change history columns.
π§± Step-by-Step Example in dbt
Letβs assume your snapshot is materialized as a table with partitioning by snapshot_date
.
1οΈβ£ Create a new dbt model: fct_orders_snapshot_merged.sql
{{ config(
materialized = 'table',
partition_by = {'field': 'snapshot_month', 'data_type': 'date'},
unique_key = ['order_id', 'dbt_valid_from'],
cluster_by = ['order_id']
) }}
with base as (
select *
from {{ ref('fct_orders_snapshot') }}
where snapshot_date < date_trunc('month', current_date - interval '3 months')
),
merged as (
select
order_id,
status,
dbt_valid_from,
dbt_valid_to,
dbt_updated_at,
date_trunc('month', dbt_valid_from) as snapshot_month
from base
)
select * from merged
This creates monthly merged partitions for data older than 3 months.
2οΈβ£ Keep recent snapshot data unmerged
Let recent changes remain partitioned daily in fct_orders_snapshot
.
Then union the merged and recent data in a final model:
-- fct_orders_snapshot_final.sql
with merged as (
select * from {{ ref('fct_orders_snapshot_merged') }}
),
recent as (
select * from {{ ref('fct_orders_snapshot') }}
where snapshot_date >= date_trunc('month', current_date - interval '3 months')
)
select * from merged
union all
select * from recent
β Result
Feature | Benefit |
---|---|
π¦ Merged partitions | Fewer partitions for old data (e.g., one per month) |
π Metadata preserved | Still have dbt_valid_from , dbt_valid_to , status , etc. |
β‘ Faster queries | Especially on recent data |
π° Lower storage cost | Fewer small files or partitions |
π Lifecycle ready | You can purge very old partitions if needed (e.g., >2 years) |
π Optional: Automate with dbt macros
You can write a macro that:
- Identifies eligible old partitions
- Merges them dynamically
- Updates metadata consistently
This is especially useful in BigQuery or Snowflake, where partition management is key to performance and cost.
β Summary
Pattern | Purpose |
---|---|
Partition merging | Consolidates old data to improve performance |
Metadata preservation | Keeps full change history (dbt_valid_from , etc.) |
Final view | Combines merged + recent data into one consistent table |
Storage win | Lower cost + better performance without losing audit trail |