dbt: how to implement partition merging with metadata preservation in a large fact table snapshot in dbt


βœ… The Problem: Large Snapshots with Too Many Partitions

In large dbt projects using snapshots (especially slowly changing dimensions or fact snapshots), each change can create a new record per primary key per day (or other grain), leading to:

  • ⚠️ Thousands or millions of tiny partitions
  • πŸ“‰ Poor performance for queries on recent data
  • πŸ’Έ Higher storage costs
  • 🐌 Slower incremental loads

🎯 The Goal: Keep history, but merge old partitions intelligently

We want to consolidate older partitions while preserving key metadata for historical analysis (e.g., dbt_valid_from, dbt_valid_to, dbt_updated_at).


πŸ§ͺ Example Scenario

Imagine you have a snapshot of fct_orders_snapshot with the following structure:

order_idstatusdbt_valid_fromdbt_valid_to
1001shipped2024-01-012024-03-01
1001delayed2024-03-012024-05-01
1001shipped2024-05-01null

Each of these records lives in a separate daily partition, like:

  • snapshot_date = '2024-01-01'
  • snapshot_date = '2024-03-01'
  • snapshot_date = '2024-05-01'

Over time, the table becomes fragmented.


βœ… The Strategy: Partition Merging with Metadata Preservation

Merge old snapshot partitions into monthly or quarterly partitions, but keep the original change history columns.


🧱 Step-by-Step Example in dbt

Let’s assume your snapshot is materialized as a table with partitioning by snapshot_date.

1️⃣ Create a new dbt model: fct_orders_snapshot_merged.sql

{{ config(
    materialized = 'table',
    partition_by = {'field': 'snapshot_month', 'data_type': 'date'},
    unique_key = ['order_id', 'dbt_valid_from'],
    cluster_by = ['order_id']
) }}

with base as (

    select *
    from {{ ref('fct_orders_snapshot') }}
    where snapshot_date < date_trunc('month', current_date - interval '3 months')

),

merged as (

    select
        order_id,
        status,
        dbt_valid_from,
        dbt_valid_to,
        dbt_updated_at,
        date_trunc('month', dbt_valid_from) as snapshot_month
    from base

)

select * from merged

This creates monthly merged partitions for data older than 3 months.


2️⃣ Keep recent snapshot data unmerged

Let recent changes remain partitioned daily in fct_orders_snapshot.
Then union the merged and recent data in a final model:

-- fct_orders_snapshot_final.sql

with merged as (
    select * from {{ ref('fct_orders_snapshot_merged') }}
),

recent as (
    select * from {{ ref('fct_orders_snapshot') }}
    where snapshot_date >= date_trunc('month', current_date - interval '3 months')
)

select * from merged
union all
select * from recent

βœ… Result

FeatureBenefit
πŸ“¦ Merged partitionsFewer partitions for old data (e.g., one per month)
πŸ” Metadata preservedStill have dbt_valid_from, dbt_valid_to, status, etc.
⚑ Faster queriesEspecially on recent data
πŸ’° Lower storage costFewer small files or partitions
πŸ”„ Lifecycle readyYou can purge very old partitions if needed (e.g., >2 years)

πŸ” Optional: Automate with dbt macros

You can write a macro that:

  • Identifies eligible old partitions
  • Merges them dynamically
  • Updates metadata consistently

This is especially useful in BigQuery or Snowflake, where partition management is key to performance and cost.


βœ… Summary

PatternPurpose
Partition mergingConsolidates old data to improve performance
Metadata preservationKeeps full change history (dbt_valid_from, etc.)
Final viewCombines merged + recent data into one consistent table
Storage winLower cost + better performance without losing audit trail

Leave a Reply

Your email address will not be published. Required fields are marked *