dbt: how to implement partition merging with metadata preservation in a large fact table snapshot in dbt


✅ The Problem: Large Snapshots with Too Many Partitions

In large dbt projects using snapshots (especially slowly changing dimensions or fact snapshots), each change can create a new record per primary key per day (or other grain), leading to:

  • ⚠️ Thousands or millions of tiny partitions
  • 📉 Poor performance for queries on recent data
  • 💸 Higher storage costs
  • 🐌 Slower incremental loads

🎯 The Goal: Keep history, but merge old partitions intelligently

We want to consolidate older partitions while preserving key metadata for historical analysis (e.g., dbt_valid_from, dbt_valid_to, dbt_updated_at).


🧪 Example Scenario

Imagine you have a snapshot of fct_orders_snapshot with the following structure:

order_idstatusdbt_valid_fromdbt_valid_to
1001shipped2024-01-012024-03-01
1001delayed2024-03-012024-05-01
1001shipped2024-05-01null

Each of these records lives in a separate daily partition, like:

  • snapshot_date = '2024-01-01'
  • snapshot_date = '2024-03-01'
  • snapshot_date = '2024-05-01'

Over time, the table becomes fragmented.


✅ The Strategy: Partition Merging with Metadata Preservation

Merge old snapshot partitions into monthly or quarterly partitions, but keep the original change history columns.


🧱 Step-by-Step Example in dbt

Let’s assume your snapshot is materialized as a table with partitioning by snapshot_date.

1️⃣ Create a new dbt model: fct_orders_snapshot_merged.sql

{{ config(
    materialized = 'table',
    partition_by = {'field': 'snapshot_month', 'data_type': 'date'},
    unique_key = ['order_id', 'dbt_valid_from'],
    cluster_by = ['order_id']
) }}

with base as (

    select *
    from {{ ref('fct_orders_snapshot') }}
    where snapshot_date < date_trunc('month', current_date - interval '3 months')

),

merged as (

    select
        order_id,
        status,
        dbt_valid_from,
        dbt_valid_to,
        dbt_updated_at,
        date_trunc('month', dbt_valid_from) as snapshot_month
    from base

)

select * from merged

This creates monthly merged partitions for data older than 3 months.


2️⃣ Keep recent snapshot data unmerged

Let recent changes remain partitioned daily in fct_orders_snapshot.
Then union the merged and recent data in a final model:

-- fct_orders_snapshot_final.sql

with merged as (
    select * from {{ ref('fct_orders_snapshot_merged') }}
),

recent as (
    select * from {{ ref('fct_orders_snapshot') }}
    where snapshot_date >= date_trunc('month', current_date - interval '3 months')
)

select * from merged
union all
select * from recent

✅ Result

FeatureBenefit
📦 Merged partitionsFewer partitions for old data (e.g., one per month)
🔍 Metadata preservedStill have dbt_valid_from, dbt_valid_to, status, etc.
⚡ Faster queriesEspecially on recent data
💰 Lower storage costFewer small files or partitions
🔄 Lifecycle readyYou can purge very old partitions if needed (e.g., >2 years)

🔁 Optional: Automate with dbt macros

You can write a macro that:

  • Identifies eligible old partitions
  • Merges them dynamically
  • Updates metadata consistently

This is especially useful in BigQuery or Snowflake, where partition management is key to performance and cost.


✅ Summary

PatternPurpose
Partition mergingConsolidates old data to improve performance
Metadata preservationKeeps full change history (dbt_valid_from, etc.)
Final viewCombines merged + recent data into one consistent table
Storage winLower cost + better performance without losing audit trail

Leave a Reply

Your email address will not be published. Required fields are marked *