dbt: Hybrid Slowly Changing Dimension (SCD) Type1 & Type 2


๐Ÿง  What is a Hybrid SCD?

A Hybrid SCD combines:

  • Type 1 columns โ†’ overwrite values in place (no history)
  • Type 2 columns โ†’ track history by versioning rows with effective dates and surrogate keys

โœ… Why use Hybrid SCD?

FeatureBenefit
Type 1 โ†’ OverwriteFor non-critical attributes like phone number, email, etc.
Type 2 โ†’ VersioningFor business-critical changes like region, segment, or status
Hybrid โ†’ Combined modelAvoids splitting tables or duplicating unnecessary history

๐Ÿ“ฆ Example Use Case: Customer Dimension

We have a dim_customers table with these columns:

AttributeSCD TypeNotes
customer_idNatural Key
customer_skSurrogate Key (Primary Key)
nameType 1Typo corrections are fine
emailType 1Can be updated directly
regionType 2We want to track changes over time
is_activeType 2Used in reporting/audits
valid_fromStart timestamp for this version
valid_toEnd timestamp (null = current)
is_currentBoolean flag for current row

๐Ÿ› ๏ธ dbt Implementation: Hybrid SCD with Selective Versioning

โœ… Step 1: Define your model

๐Ÿ“ models/dim/dim_customers.sql

{{ config(
    materialized='incremental',
    unique_key='customer_id',
    incremental_strategy='insert_overwrite'
) }}

with source as (

  select
    customer_id,
    name,
    email,
    region,
    is_active,
    current_timestamp() as updated_at
  from {{ ref('stg_customers') }}

),

-- Get current version of customer from dimension
existing as (

  select *
  from {{ this }}
  where is_current = true

),

-- Join source with existing data
changes as (

  select
    src.*,
    dim.customer_sk,
    dim.region as old_region,
    dim.is_active as old_is_active,
    dim.name as old_name,
    dim.email as old_email
  from source src
  left join existing dim
    on src.customer_id = dim.customer_id

),

-- Identify rows needing Type 2 handling
scd_type_2 as (

  select *
  from changes
  where region != old_region or is_active != old_is_active

),

-- Identify rows needing Type 1 update only
scd_type_1 as (

  select *
  from changes
  where customer_sk is not null
    and (region = old_region and is_active = old_is_active)
    and (name != old_name or email != old_email)

),

final as (

  -- New rows for Type 2 changes
  select
    {{ dbt_utils.surrogate_key(['customer_id', 'updated_at']) }} as customer_sk,
    customer_id,
    name,
    email,
    region,
    is_active,
    updated_at as valid_from,
    null as valid_to,
    true as is_current
  from scd_type_2

  union all

  -- Updated rows for Type 1 changes
  select
    customer_sk,
    customer_id,
    name,
    email,
    region,
    is_active,
    valid_from,
    valid_to,
    is_current
  from {{ this }}
  where is_current = true
    and customer_id in (select customer_id from scd_type_1)

)

select * from final

โœ… Step 2: Benefits of this Hybrid SCD Strategy

AspectAdvantage
๐Ÿง  Logic separationClear Type 1 vs Type 2 distinction
๐Ÿงพ Efficient storageOnly version rows when necessary
๐Ÿ“Š Unified query surfaceOne dimension table, no need to join Type 1 and Type 2 tables
๐Ÿš€ Incremental-friendlyWorks well with insert_overwrite strategy in dbt
๐Ÿ” Easier maintenanceAll rules live in one model with clean logic blocks

โœ… Step 3: Downstream Usage

When querying, you typically filter for current records:

select *
from {{ ref('dim_customers') }}
where is_current = true

Or use valid_from and valid_to for historical views or point-in-time analysis.


๐Ÿงผ Summary

  • Hybrid SCD is a balanced approach to handling changing dimensions.
  • Use Type 2 versioning only when needed, and Type 1 updates otherwise.
  • dbt makes this efficient via:
    • incremental models
    • insert_overwrite strategies
    • Macros like surrogate_key

Leave a Reply

Your email address will not be published. Required fields are marked *