๐ง What is a Hybrid SCD?
A Hybrid SCD combines:
- Type 1 columns โ overwrite values in place (no history)
- Type 2 columns โ track history by versioning rows with effective dates and surrogate keys
โ Why use Hybrid SCD?
Feature | Benefit |
---|---|
Type 1 โ Overwrite | For non-critical attributes like phone number, email, etc. |
Type 2 โ Versioning | For business-critical changes like region, segment, or status |
Hybrid โ Combined model | Avoids splitting tables or duplicating unnecessary history |
๐ฆ Example Use Case: Customer Dimension
We have a dim_customers
table with these columns:
Attribute | SCD Type | Notes |
---|---|---|
customer_id | Natural Key | |
customer_sk | Surrogate Key (Primary Key) | |
name | Type 1 | Typo corrections are fine |
email | Type 1 | Can be updated directly |
region | Type 2 | We want to track changes over time |
is_active | Type 2 | Used in reporting/audits |
valid_from | – | Start timestamp for this version |
valid_to | – | End timestamp (null = current) |
is_current | – | Boolean flag for current row |
๐ ๏ธ dbt Implementation: Hybrid SCD with Selective Versioning
โ Step 1: Define your model
๐ models/dim/dim_customers.sql
{{ config(
materialized='incremental',
unique_key='customer_id',
incremental_strategy='insert_overwrite'
) }}
with source as (
select
customer_id,
name,
email,
region,
is_active,
current_timestamp() as updated_at
from {{ ref('stg_customers') }}
),
-- Get current version of customer from dimension
existing as (
select *
from {{ this }}
where is_current = true
),
-- Join source with existing data
changes as (
select
src.*,
dim.customer_sk,
dim.region as old_region,
dim.is_active as old_is_active,
dim.name as old_name,
dim.email as old_email
from source src
left join existing dim
on src.customer_id = dim.customer_id
),
-- Identify rows needing Type 2 handling
scd_type_2 as (
select *
from changes
where region != old_region or is_active != old_is_active
),
-- Identify rows needing Type 1 update only
scd_type_1 as (
select *
from changes
where customer_sk is not null
and (region = old_region and is_active = old_is_active)
and (name != old_name or email != old_email)
),
final as (
-- New rows for Type 2 changes
select
{{ dbt_utils.surrogate_key(['customer_id', 'updated_at']) }} as customer_sk,
customer_id,
name,
email,
region,
is_active,
updated_at as valid_from,
null as valid_to,
true as is_current
from scd_type_2
union all
-- Updated rows for Type 1 changes
select
customer_sk,
customer_id,
name,
email,
region,
is_active,
valid_from,
valid_to,
is_current
from {{ this }}
where is_current = true
and customer_id in (select customer_id from scd_type_1)
)
select * from final
โ Step 2: Benefits of this Hybrid SCD Strategy
Aspect | Advantage |
---|---|
๐ง Logic separation | Clear Type 1 vs Type 2 distinction |
๐งพ Efficient storage | Only version rows when necessary |
๐ Unified query surface | One dimension table, no need to join Type 1 and Type 2 tables |
๐ Incremental-friendly | Works well with insert_overwrite strategy in dbt |
๐ Easier maintenance | All rules live in one model with clean logic blocks |
โ Step 3: Downstream Usage
When querying, you typically filter for current records:
select *
from {{ ref('dim_customers') }}
where is_current = true
Or use valid_from
and valid_to
for historical views or point-in-time analysis.
๐งผ Summary
- Hybrid SCD is a balanced approach to handling changing dimensions.
- Use Type 2 versioning only when needed, and Type 1 updates otherwise.
- dbt makes this efficient via:
incremental
modelsinsert_overwrite
strategies- Macros like
surrogate_key