๐ฏ What is Data Masking?
Data masking is the process of obfuscating sensitive data while preserving its format and structure for development, testing, and analytics purposes. In dbt, we implement this through reusable macros that consistently apply masking rules across all models.
๐จ Why Use Macros for Data Masking?
โ Problems with Hardcoded Masking
-- Bad: Hardcoded masking scattered across models
-- models/customers.sql
SELECT
customer_id,
CONCAT(LEFT(email, 3), '****@****.com') as email, -- Inconsistent
customer_name
FROM raw_customers
-- models/orders.sql
SELECT
order_id,
SUBSTRING(email, 1, 2) || '***@***.com' as customer_email, -- Different logic!
order_total
FROM raw_orders
Issues:
- Inconsistent masking patterns
- Hard to maintain when rules change
- No central governance
- Difficult to test masking effectiveness
โ Benefits of Macro-Based Masking
- Consistency: Same masking logic everywhere
- Maintainability: Change once, update everywhere
- Testability: Validate masking patterns
- Documentation: Clear data protection measures
- Compliance: Easy to adapt to new regulations
๐ ๏ธ Complete Data Masking Implementation
Step 1: Core Masking Macros
-- macros/data_masking.sql
{% macro mask_email(email_column) %}
CASE
WHEN {{ email_column }} IS NULL THEN NULL
WHEN {{ email_column }} = '' THEN ''
ELSE CONCAT(
LEFT({{ email_column }}, 2),
'***@',
CASE
WHEN POSITION('@' IN {{ email_column }}) > 0
THEN RIGHT(SPLIT({{ email_column }}, '@')[OFFSET(1)], 3)
ELSE '***'
END,
'.com'
)
END
{% endmacro %}
{% macro mask_phone(phone_column) %}
CASE
WHEN {{ phone_column }} IS NULL THEN NULL
WHEN LENGTH({{ phone_column }}) < 4 THEN '***-***-****'
ELSE CONCAT(
'***-***-',
RIGHT({{ phone_column }}, 4)
)
END
{% endmacro %}
{% macro mask_ssn(ssn_column) %}
CASE
WHEN {{ ssn_column }} IS NULL THEN NULL
WHEN LENGTH({{ ssn_column }}) >= 4
THEN CONCAT('***-**-', RIGHT({{ ssn_column }}, 4))
ELSE '***-**-****'
END
{% endmacro %}
{% macro mask_credit_card(cc_column) %}
CASE
WHEN {{ cc_column }} IS NULL THEN NULL
WHEN LENGTH({{ cc_column }}) >= 4
THEN CONCAT('****-****-****-', RIGHT({{ cc_column }}, 4))
ELSE '****-****-****-****'
END
{% endmacro %}
{% macro mask_name(name_column, mask_type='partial') %}
CASE
WHEN {{ name_column }} IS NULL THEN NULL
WHEN '{{ mask_type }}' = 'full' THEN '***MASKED***'
WHEN '{{ mask_type }}' = 'first_letter'
THEN CONCAT(LEFT({{ name_column }}, 1), REPEAT('*', LENGTH({{ name_column }}) - 1))
ELSE -- partial (default)
CASE
WHEN LENGTH({{ name_column }}) <= 2 THEN '**'
ELSE CONCAT(LEFT({{ name_column }}, 2), REPEAT('*', LENGTH({{ name_column }}) - 2))
END
END
{% endmacro %}
{% macro mask_address(address_column) %}
CASE
WHEN {{ address_column }} IS NULL THEN NULL
ELSE CONCAT('*** ', 'MASKED ADDRESS ', '***')
END
{% endmacro %}
Step 2: Advanced Masking with Business Logic
-- macros/advanced_masking.sql
{% macro conditional_masking(column, condition, mask_value='***REDACTED***') %}
CASE
WHEN {{ condition }} THEN '{{ mask_value }}'
ELSE {{ column }}
END
{% endmacro %}
{% macro role_based_masking(column, user_role_column, allowed_roles=[]) %}
{% set roles_list = allowed_roles | join("','") %}
CASE
WHEN {{ user_role_column }} IN ('{{ roles_list }}') THEN {{ column }}
ELSE '***UNAUTHORIZED***'
END
{% endmacro %}
{% macro environment_based_masking(column, mask_in_dev=true) %}
{% if target.name in ['dev', 'development'] and mask_in_dev %}
{{ mask_email(column) if 'email' in column.lower()
else mask_phone(column) if 'phone' in column.lower()
else mask_name(column) }}
{% else %}
{{ column }}
{% endif %}
{% endmacro %}
{% macro dynamic_masking(column, data_type, sensitivity_level='medium') %}
{% if sensitivity_level == 'high' %}
'***HIGHLY_SENSITIVE***'
{% elif sensitivity_level == 'medium' %}
{% if data_type == 'email' %}
{{ mask_email(column) }}
{% elif data_type == 'phone' %}
{{ mask_phone(column) }}
{% elif data_type == 'name' %}
{{ mask_name(column, 'partial') }}
{% else %}
CONCAT('***', RIGHT({{ column }}, 2))
{% endif %}
{% else %}
{{ column }} -- Low sensitivity, no masking
{% endif %}
{% endmacro %}
Step 3: Masking Configuration System
# dbt_project.yml
vars:
data_masking:
enabled: true
environment_rules:
dev: 'mask_all'
staging: 'mask_pii_only'
prod: 'no_masking'
sensitivity_levels:
high:
- 'ssn'
- 'credit_card'
- 'bank_account'
medium:
- 'email'
- 'phone'
- 'full_name'
- 'address'
low:
- 'city'
- 'state'
- 'zip_code'
compliance_standards:
gdpr_enabled: true
hipaa_enabled: false
pci_dss_enabled: true
Step 4: Smart Masking Macro
-- macros/smart_masking.sql
{% macro apply_smart_masking(column_name, data_type=none, sensitivity=none) %}
{% set masking_config = var('data_masking', {}) %}
{% set environment = target.name %}
{% set masking_enabled = masking_config.get('enabled', false) %}
{% if not masking_enabled %}
{{ return(column_name) }}
{% endif %}
{% set env_rule = masking_config.get('environment_rules', {}).get(environment, 'no_masking') %}
{% if env_rule == 'no_masking' %}
{{ return(column_name) }}
{% endif %}
-- Auto-detect data type from column name if not provided
{% if not data_type %}
{% if 'email' in column_name.lower() %}
{% set data_type = 'email' %}
{% elif 'phone' in column_name.lower() %}
{% set data_type = 'phone' %}
{% elif 'ssn' in column_name.lower() %}
{% set data_type = 'ssn' %}
{% elif 'credit_card' in column_name.lower() or 'cc_' in column_name.lower() %}
{% set data_type = 'credit_card' %}
{% elif 'name' in column_name.lower() %}
{% set data_type = 'name' %}
{% elif 'address' in column_name.lower() %}
{% set data_type = 'address' %}
{% else %}
{% set data_type = 'generic' %}
{% endif %}
{% endif %}
-- Auto-detect sensitivity level
{% if not sensitivity %}
{% set high_sensitivity = masking_config.get('sensitivity_levels', {}).get('high', []) %}
{% set medium_sensitivity = masking_config.get('sensitivity_levels', {}).get('medium', []) %}
{% if data_type in high_sensitivity %}
{% set sensitivity = 'high' %}
{% elif data_type in medium_sensitivity %}
{% set sensitivity = 'medium' %}
{% else %}
{% set sensitivity = 'low' %}
{% endif %}
{% endif %}
-- Apply masking based on environment rule and sensitivity
{% if env_rule == 'mask_all' or (env_rule == 'mask_pii_only' and sensitivity in ['high', 'medium']) %}
{% if data_type == 'email' %}
{{ mask_email(column_name) }}
{% elif data_type == 'phone' %}
{{ mask_phone(column_name) }}
{% elif data_type == 'ssn' %}
{{ mask_ssn(column_name) }}
{% elif data_type == 'credit_card' %}
{{ mask_credit_card(column_name) }}
{% elif data_type == 'name' %}
{{ mask_name(column_name) }}
{% elif data_type == 'address' %}
{{ mask_address(column_name) }}
{% else %}
CONCAT('***', RIGHT({{ column_name }}, 2))
{% endif %}
{% else %}
{{ column_name }}
{% endif %}
{% endmacro %}
๐ Practical Implementation Examples
Example 1: Customer Data Model
-- models/marts/customers_masked.sql
{{ config(
materialized='view',
meta={
'data_classification': 'pii_sensitive',
'masking_applied': true
}
) }}
SELECT
customer_id,
-- โ
Consistent masking using macros
{{ apply_smart_masking('email') }} as email,
{{ apply_smart_masking('phone_number') }} as phone_number,
{{ apply_smart_masking('first_name') }} as first_name,
{{ apply_smart_masking('last_name') }} as last_name,
{{ apply_smart_masking('address_line_1') }} as address_line_1,
-- Non-sensitive data remains unchanged
city,
state,
country,
customer_segment,
created_date,
last_login_date
FROM {{ ref('stg_customers') }}
Example 2: Orders with Conditional Masking
-- models/marts/orders_with_privacy.sql
{{ config(materialized='table') }}
SELECT
order_id,
customer_id,
-- Conditional masking based on order status
{{ conditional_masking(
'customer_email',
'order_status = \'refunded\'',
'***REFUNDED_CUSTOMER***'
) }} as customer_email,
-- Role-based masking for payment info
{{ role_based_masking(
'payment_method',
'current_user_role',
['admin', 'finance_manager']
) }} as payment_method,
order_date,
order_total,
order_status
FROM {{ ref('int_orders_enriched') }}
Example 3: Multi-Environment Masking
-- models/staging/stg_employees.sql
{{ config(
materialized='view',
meta={
'contains_pii': true,
'masking_strategy': 'environment_based'
}
) }}
SELECT
employee_id,
-- Environment-aware masking
{% if target.name in ['dev', 'test'] %}
{{ mask_email('work_email') }} as work_email,
{{ mask_phone('work_phone') }} as work_phone,
{{ mask_name('full_name', 'first_letter') }} as full_name,
{{ mask_ssn('ssn') }} as ssn,
{% else %}
work_email,
work_phone,
full_name,
ssn,
{% endif %}
department,
job_title,
hire_date,
salary_band -- Keep salary band, not exact salary
FROM {{ source('hr_system', 'employees') }}
๐งช Testing Masked Data
Masking Validation Tests
-- tests/test_masking_effectiveness.sql
{% set masked_models = ['customers_masked', 'orders_with_privacy'] %}
{% for model in masked_models %}
-- Test: No real emails should exist in masked data
WITH email_check AS (
SELECT email
FROM {{ ref(model) }}
WHERE email LIKE '%@gmail.com'
OR email LIKE '%@yahoo.com'
OR email LIKE '%@hotmail.com'
)
SELECT
'{{ model }}' as model_name,
'Real emails found in masked data' as issue,
COUNT(*) as violation_count
FROM email_check
HAVING COUNT(*) > 0
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}
-- Test: Phone numbers should be masked
UNION ALL
SELECT
'customers_masked' as model_name,
'Unmasked phone numbers detected' as issue,
COUNT(*) as violation_count
FROM {{ ref('customers_masked') }}
WHERE phone_number NOT LIKE '***-***-%'
AND phone_number IS NOT NULL
HAVING COUNT(*) > 0
Data Quality Tests for Masked Fields
-- tests/test_masked_data_quality.sql
-- Test: Masked emails still follow expected format
SELECT
'Invalid masked email format' as test_name,
COUNT(*) as failures
FROM {{ ref('customers_masked') }}
WHERE email IS NOT NULL
AND email != ''
AND email NOT REGEXP r'^[a-zA-Z0-9]{1,2}\*{3}@[a-zA-Z0-9]+\.com$'
HAVING COUNT(*) > 0
UNION ALL
-- Test: Masked phone numbers preserve length information
SELECT
'Masked phone format inconsistent' as test_name,
COUNT(*) as failures
FROM {{ ref('customers_masked') }}
WHERE phone_number IS NOT NULL
AND phone_number NOT LIKE '***-***-____'
HAVING COUNT(*) > 0
UNION ALL
-- Test: All high-sensitivity fields are masked in dev
SELECT
'High sensitivity data not masked in dev' as test_name,
COUNT(*) as failures
FROM {{ ref('stg_employees') }}
WHERE '{{ target.name }}' = 'dev'
AND (ssn NOT LIKE '***-**-____' OR ssn IS NULL)
HAVING COUNT(*) > 0
๐ Masking Documentation and Governance
Data Catalog with Masking Info
# models/schema.yml
version: 2
models:
- name: customers_masked
description: |
Customer data with PII masking applied for non-production environments.
**Masking Rules Applied:**
- Email: Shows first 2 chars + ***@***.com
- Phone: Shows last 4 digits as ***-***-XXXX
- Names: Shows first 2 chars + asterisks
- Address: Completely masked
**Environment Behavior:**
- Dev/Test: All PII fields masked
- Staging: Medium/High sensitivity masked
- Production: No masking (full data access)
meta:
data_classification: 'pii_sensitive'
masking_applied: true
compliance_standards: ['GDPR', 'CCPA']
columns:
- name: customer_id
description: "Unique customer identifier (never masked)"
- name: email
description: "Customer email address"
meta:
masking_rule: "mask_email"
sensitivity_level: "medium"
original_format_preserved: true
- name: phone_number
description: "Customer phone number"
meta:
masking_rule: "mask_phone"
sensitivity_level: "medium"
last_4_digits_visible: true
Compliance Reporting
-- models/governance/masking_compliance_report.sql
{{ config(materialized='table') }}
WITH model_masking_status AS (
SELECT
'customers_masked' as model_name,
'customers' as source_table,
['email', 'phone_number', 'first_name', 'last_name', 'address_line_1'] as masked_columns,
'PII protection for customer data' as masking_purpose,
['GDPR', 'CCPA'] as compliance_standards
UNION ALL
SELECT
'orders_with_privacy' as model_name,
'orders' as source_table,
['customer_email', 'payment_method'] as masked_columns,
'Financial data protection' as masking_purpose,
['PCI_DSS'] as compliance_standards
),
environment_coverage AS (
SELECT
model_name,
CASE '{{ target.name }}'
WHEN 'dev' THEN 'Full masking active'
WHEN 'staging' THEN 'Partial masking active'
WHEN 'prod' THEN 'No masking (authorized access)'
END as environment_status
FROM model_masking_status
)
SELECT
m.model_name,
m.source_table,
ARRAY_TO_STRING(m.masked_columns, ', ') as protected_fields,
m.masking_purpose,
ARRAY_TO_STRING(m.compliance_standards, ', ') as compliance_met,
e.environment_status,
CURRENT_TIMESTAMP as report_generated_at
FROM model_masking_status m
JOIN environment_coverage e USING (model_name)
๐ Advanced Use Cases
Dynamic Masking Based on User Context
-- macros/context_aware_masking.sql
{% macro mask_by_user_context(column, user_department, user_role) %}
CASE
WHEN {{ user_role }} = 'admin' THEN {{ column }}
WHEN {{ user_department }} = 'finance' AND '{{ column }}'.contains('payment')
THEN {{ column }}
WHEN {{ user_department }} = 'marketing' AND '{{ column }}'.contains('email')
THEN {{ mask_email(column) }}
ELSE '***UNAUTHORIZED***'
END
{% endmacro %}
-- Usage in model
SELECT
customer_id,
{{ mask_by_user_context('email', 'current_user_dept', 'current_user_role') }} as email
FROM customers
Audit Trail for Masked Data Access
-- models/governance/masked_data_access_log.sql
{{ config(
materialized='incremental',
unique_key=['access_timestamp', 'user_id', 'model_accessed']
) }}
SELECT
CURRENT_TIMESTAMP as access_timestamp,
'{{ target.name }}' as environment,
SESSION_USER() as user_id,
'{{ this.identifier }}' as model_accessed,
'masked_data_query' as access_type,
ARRAY['email', 'phone'] as masked_fields_accessed
-- This would be expanded with actual session tracking
๐ก Best Practices Summary
โ Do’s
- Centralize masking logic in reusable macros
- Test masking effectiveness with automated tests
- Document masking rules clearly in schema files
- Use environment-aware masking strategies
- Implement role-based access controls
- Monitor and audit masked data access
โ Don’ts
- Hardcode masking directly in models
- Inconsistent masking patterns across models
- Forget to test masked data quality
- Over-mask non-sensitive data
- Under-mask in development environments
- Skip compliance documentation
๐ฏ Key Benefits Achieved
Consistency: Same masking rules applied everywhere Maintainability: Change masking logic in one place Testability: Validate masking with automated tests Compliance: Meet regulatory requirements systematically Flexibility: Adapt to different environments and roles Documentation: Clear audit trail of data protection measures
This macro-based approach transforms data masking from a scattered manual process into a systematic, governed, and maintainable data protection strategy! ๐ก๏ธ