dbt: how to implement partition merging with metadata preservation in a large fact table snapshot in dbt

✅ The Problem: Large Snapshots with Too Many Partitions

In large dbt projects using snapshots (especially slowly changing dimensions or fact snapshots), each change can create a new record per primary key per day (or other grain), leading to:

⚠️ Thousands or millions of tiny partitions
📉 Poor performance for queries on recent data
💸 Higher storage costs
🐌 Slower incremental loads

🎯 The Goal: Keep history, but merge old partitions intelligently

We want to consolidate older partitions while preserving key metadata for historical analysis (e.g., dbt_valid_from, dbt_valid_to, dbt_updated_at).

🧪 Example Scenario

Imagine you have a snapshot of fct_orders_snapshot with the following structure:

order_id	status	dbt_valid_from	dbt_valid_to
1001	shipped	2024-01-01	2024-03-01
1001	delayed	2024-03-01	2024-05-01
1001	shipped	2024-05-01	null

Each of these records lives in a separate daily partition, like:

snapshot_date = '2024-01-01'
snapshot_date = '2024-03-01'
snapshot_date = '2024-05-01'

Over time, the table becomes fragmented.

✅ The Strategy: Partition Merging with Metadata Preservation

Merge old snapshot partitions into monthly or quarterly partitions, but keep the original change history columns.

🧱 Step-by-Step Example in dbt

Let’s assume your snapshot is materialized as a table with partitioning by snapshot_date.

1️⃣ Create a new dbt model: `fct_orders_snapshot_merged.sql`

{{ config(
    materialized = 'table',
    partition_by = {'field': 'snapshot_month', 'data_type': 'date'},
    unique_key = ['order_id', 'dbt_valid_from'],
    cluster_by = ['order_id']
) }}

with base as (

    select *
    from {{ ref('fct_orders_snapshot') }}
    where snapshot_date < date_trunc('month', current_date - interval '3 months')

),

merged as (

    select
        order_id,
        status,
        dbt_valid_from,
        dbt_valid_to,
        dbt_updated_at,
        date_trunc('month', dbt_valid_from) as snapshot_month
    from base

)

select * from merged

This creates monthly merged partitions for data older than 3 months.

2️⃣ Keep recent snapshot data unmerged

Let recent changes remain partitioned daily in fct_orders_snapshot.
Then union the merged and recent data in a final model:

-- fct_orders_snapshot_final.sql

with merged as (
    select * from {{ ref('fct_orders_snapshot_merged') }}
),

recent as (
    select * from {{ ref('fct_orders_snapshot') }}
    where snapshot_date >= date_trunc('month', current_date - interval '3 months')
)

select * from merged
union all
select * from recent

✅ Result

Feature	Benefit
📦 Merged partitions	Fewer partitions for old data (e.g., one per month)
🔍 Metadata preserved	Still have `dbt_valid_from`, `dbt_valid_to`, `status`, etc.
⚡ Faster queries	Especially on recent data
💰 Lower storage cost	Fewer small files or partitions
🔄 Lifecycle ready	You can purge very old partitions if needed (e.g., >2 years)

🔁 Optional: Automate with dbt macros

You can write a macro that:

Identifies eligible old partitions
Merges them dynamically
Updates metadata consistently

This is especially useful in BigQuery or Snowflake, where partition management is key to performance and cost.

✅ Summary

Pattern	Purpose
Partition merging	Consolidates old data to improve performance
Metadata preservation	Keeps full change history (`dbt_valid_from`, etc.)
Final view	Combines merged + recent data into one consistent table
Storage win	Lower cost + better performance without losing audit trail