dbt: Data Masking in dbt: Reusable Macros Best Practices

๐ŸŽฏ What is Data Masking?

Data masking is the process of obfuscating sensitive data while preserving its format and structure for development, testing, and analytics purposes. In dbt, we implement this through reusable macros that consistently apply masking rules across all models.

๐Ÿšจ Why Use Macros for Data Masking?

โŒ Problems with Hardcoded Masking

-- Bad: Hardcoded masking scattered across models
-- models/customers.sql
SELECT 
  customer_id,
  CONCAT(LEFT(email, 3), '****@****.com') as email,  -- Inconsistent
  customer_name
FROM raw_customers

-- models/orders.sql  
SELECT
  order_id,
  SUBSTRING(email, 1, 2) || '***@***.com' as customer_email,  -- Different logic!
  order_total
FROM raw_orders

Issues:

  • Inconsistent masking patterns
  • Hard to maintain when rules change
  • No central governance
  • Difficult to test masking effectiveness

โœ… Benefits of Macro-Based Masking

  • Consistency: Same masking logic everywhere
  • Maintainability: Change once, update everywhere
  • Testability: Validate masking patterns
  • Documentation: Clear data protection measures
  • Compliance: Easy to adapt to new regulations

๐Ÿ› ๏ธ Complete Data Masking Implementation

Step 1: Core Masking Macros

-- macros/data_masking.sql

{% macro mask_email(email_column) %}
  CASE 
    WHEN {{ email_column }} IS NULL THEN NULL
    WHEN {{ email_column }} = '' THEN ''
    ELSE CONCAT(
      LEFT({{ email_column }}, 2),
      '***@',
      CASE 
        WHEN POSITION('@' IN {{ email_column }}) > 0 
        THEN RIGHT(SPLIT({{ email_column }}, '@')[OFFSET(1)], 3)
        ELSE '***'
      END,
      '.com'
    )
  END
{% endmacro %}

{% macro mask_phone(phone_column) %}
  CASE 
    WHEN {{ phone_column }} IS NULL THEN NULL
    WHEN LENGTH({{ phone_column }}) < 4 THEN '***-***-****'
    ELSE CONCAT(
      '***-***-',
      RIGHT({{ phone_column }}, 4)
    )
  END
{% endmacro %}

{% macro mask_ssn(ssn_column) %}
  CASE 
    WHEN {{ ssn_column }} IS NULL THEN NULL
    WHEN LENGTH({{ ssn_column }}) >= 4 
    THEN CONCAT('***-**-', RIGHT({{ ssn_column }}, 4))
    ELSE '***-**-****'
  END
{% endmacro %}

{% macro mask_credit_card(cc_column) %}
  CASE 
    WHEN {{ cc_column }} IS NULL THEN NULL
    WHEN LENGTH({{ cc_column }}) >= 4
    THEN CONCAT('****-****-****-', RIGHT({{ cc_column }}, 4))
    ELSE '****-****-****-****'
  END
{% endmacro %}

{% macro mask_name(name_column, mask_type='partial') %}
  CASE 
    WHEN {{ name_column }} IS NULL THEN NULL
    WHEN '{{ mask_type }}' = 'full' THEN '***MASKED***'
    WHEN '{{ mask_type }}' = 'first_letter' 
    THEN CONCAT(LEFT({{ name_column }}, 1), REPEAT('*', LENGTH({{ name_column }}) - 1))
    ELSE -- partial (default)
      CASE 
        WHEN LENGTH({{ name_column }}) <= 2 THEN '**'
        ELSE CONCAT(LEFT({{ name_column }}, 2), REPEAT('*', LENGTH({{ name_column }}) - 2))
      END
  END
{% endmacro %}

{% macro mask_address(address_column) %}
  CASE 
    WHEN {{ address_column }} IS NULL THEN NULL
    ELSE CONCAT('*** ', 'MASKED ADDRESS ', '***')
  END
{% endmacro %}

Step 2: Advanced Masking with Business Logic

-- macros/advanced_masking.sql

{% macro conditional_masking(column, condition, mask_value='***REDACTED***') %}
  CASE 
    WHEN {{ condition }} THEN '{{ mask_value }}'
    ELSE {{ column }}
  END
{% endmacro %}

{% macro role_based_masking(column, user_role_column, allowed_roles=[]) %}
  {% set roles_list = allowed_roles | join("','") %}
  CASE 
    WHEN {{ user_role_column }} IN ('{{ roles_list }}') THEN {{ column }}
    ELSE '***UNAUTHORIZED***'
  END
{% endmacro %}

{% macro environment_based_masking(column, mask_in_dev=true) %}
  {% if target.name in ['dev', 'development'] and mask_in_dev %}
    {{ mask_email(column) if 'email' in column.lower() 
       else mask_phone(column) if 'phone' in column.lower()
       else mask_name(column) }}
  {% else %}
    {{ column }}
  {% endif %}
{% endmacro %}

{% macro dynamic_masking(column, data_type, sensitivity_level='medium') %}
  {% if sensitivity_level == 'high' %}
    '***HIGHLY_SENSITIVE***'
  {% elif sensitivity_level == 'medium' %}
    {% if data_type == 'email' %}
      {{ mask_email(column) }}
    {% elif data_type == 'phone' %}  
      {{ mask_phone(column) }}
    {% elif data_type == 'name' %}
      {{ mask_name(column, 'partial') }}
    {% else %}
      CONCAT('***', RIGHT({{ column }}, 2))
    {% endif %}
  {% else %}
    {{ column }}  -- Low sensitivity, no masking
  {% endif %}
{% endmacro %}

Step 3: Masking Configuration System

# dbt_project.yml
vars:
  data_masking:
    enabled: true
    environment_rules:
      dev: 'mask_all'
      staging: 'mask_pii_only'  
      prod: 'no_masking'
    
    sensitivity_levels:
      high:
        - 'ssn'
        - 'credit_card'
        - 'bank_account'
      medium:
        - 'email'
        - 'phone'
        - 'full_name'
        - 'address'
      low:
        - 'city'
        - 'state'
        - 'zip_code'
    
    compliance_standards:
      gdpr_enabled: true
      hipaa_enabled: false
      pci_dss_enabled: true

Step 4: Smart Masking Macro

-- macros/smart_masking.sql
{% macro apply_smart_masking(column_name, data_type=none, sensitivity=none) %}
  
  {% set masking_config = var('data_masking', {}) %}
  {% set environment = target.name %}
  {% set masking_enabled = masking_config.get('enabled', false) %}
  
  {% if not masking_enabled %}
    {{ return(column_name) }}
  {% endif %}
  
  {% set env_rule = masking_config.get('environment_rules', {}).get(environment, 'no_masking') %}
  
  {% if env_rule == 'no_masking' %}
    {{ return(column_name) }}
  {% endif %}
  
  -- Auto-detect data type from column name if not provided
  {% if not data_type %}
    {% if 'email' in column_name.lower() %}
      {% set data_type = 'email' %}
    {% elif 'phone' in column_name.lower() %}
      {% set data_type = 'phone' %}
    {% elif 'ssn' in column_name.lower() %}
      {% set data_type = 'ssn' %}
    {% elif 'credit_card' in column_name.lower() or 'cc_' in column_name.lower() %}
      {% set data_type = 'credit_card' %}
    {% elif 'name' in column_name.lower() %}
      {% set data_type = 'name' %}
    {% elif 'address' in column_name.lower() %}
      {% set data_type = 'address' %}
    {% else %}
      {% set data_type = 'generic' %}
    {% endif %}
  {% endif %}
  
  -- Auto-detect sensitivity level
  {% if not sensitivity %}
    {% set high_sensitivity = masking_config.get('sensitivity_levels', {}).get('high', []) %}
    {% set medium_sensitivity = masking_config.get('sensitivity_levels', {}).get('medium', []) %}
    
    {% if data_type in high_sensitivity %}
      {% set sensitivity = 'high' %}
    {% elif data_type in medium_sensitivity %}
      {% set sensitivity = 'medium' %}  
    {% else %}
      {% set sensitivity = 'low' %}
    {% endif %}
  {% endif %}
  
  -- Apply masking based on environment rule and sensitivity
  {% if env_rule == 'mask_all' or (env_rule == 'mask_pii_only' and sensitivity in ['high', 'medium']) %}
    {% if data_type == 'email' %}
      {{ mask_email(column_name) }}
    {% elif data_type == 'phone' %}
      {{ mask_phone(column_name) }}
    {% elif data_type == 'ssn' %}
      {{ mask_ssn(column_name) }}
    {% elif data_type == 'credit_card' %}
      {{ mask_credit_card(column_name) }}
    {% elif data_type == 'name' %}
      {{ mask_name(column_name) }}
    {% elif data_type == 'address' %}
      {{ mask_address(column_name) }}
    {% else %}
      CONCAT('***', RIGHT({{ column_name }}, 2))
    {% endif %}
  {% else %}
    {{ column_name }}
  {% endif %}
  
{% endmacro %}

๐Ÿ“Š Practical Implementation Examples

Example 1: Customer Data Model

-- models/marts/customers_masked.sql
{{ config(
  materialized='view',
  meta={
    'data_classification': 'pii_sensitive',
    'masking_applied': true
  }
) }}

SELECT
  customer_id,
  
  -- โœ… Consistent masking using macros
  {{ apply_smart_masking('email') }} as email,
  {{ apply_smart_masking('phone_number') }} as phone_number,
  {{ apply_smart_masking('first_name') }} as first_name,
  {{ apply_smart_masking('last_name') }} as last_name,
  {{ apply_smart_masking('address_line_1') }} as address_line_1,
  
  -- Non-sensitive data remains unchanged
  city,
  state,
  country,
  customer_segment,
  created_date,
  last_login_date

FROM {{ ref('stg_customers') }}

Example 2: Orders with Conditional Masking

-- models/marts/orders_with_privacy.sql
{{ config(materialized='table') }}

SELECT
  order_id,
  customer_id,
  
  -- Conditional masking based on order status
  {{ conditional_masking(
    'customer_email', 
    'order_status = \'refunded\'',
    '***REFUNDED_CUSTOMER***'
  ) }} as customer_email,
  
  -- Role-based masking for payment info
  {{ role_based_masking(
    'payment_method',
    'current_user_role', 
    ['admin', 'finance_manager']
  ) }} as payment_method,
  
  order_date,
  order_total,
  order_status

FROM {{ ref('int_orders_enriched') }}

Example 3: Multi-Environment Masking

-- models/staging/stg_employees.sql
{{ config(
  materialized='view',
  meta={
    'contains_pii': true,
    'masking_strategy': 'environment_based'
  }
) }}

SELECT
  employee_id,
  
  -- Environment-aware masking
  {% if target.name in ['dev', 'test'] %}
    {{ mask_email('work_email') }} as work_email,
    {{ mask_phone('work_phone') }} as work_phone,
    {{ mask_name('full_name', 'first_letter') }} as full_name,
    {{ mask_ssn('ssn') }} as ssn,
  {% else %}
    work_email,
    work_phone,
    full_name,
    ssn,
  {% endif %}
  
  department,
  job_title,
  hire_date,
  salary_band  -- Keep salary band, not exact salary

FROM {{ source('hr_system', 'employees') }}

๐Ÿงช Testing Masked Data

Masking Validation Tests

-- tests/test_masking_effectiveness.sql
{% set masked_models = ['customers_masked', 'orders_with_privacy'] %}

{% for model in masked_models %}
  
  -- Test: No real emails should exist in masked data
  WITH email_check AS (
    SELECT email
    FROM {{ ref(model) }}  
    WHERE email LIKE '%@gmail.com' 
       OR email LIKE '%@yahoo.com'
       OR email LIKE '%@hotmail.com'
  )
  SELECT 
    '{{ model }}' as model_name,
    'Real emails found in masked data' as issue,
    COUNT(*) as violation_count
  FROM email_check
  HAVING COUNT(*) > 0
  
  {% if not loop.last %}UNION ALL{% endif %}
{% endfor %}

-- Test: Phone numbers should be masked
UNION ALL

SELECT
  'customers_masked' as model_name,
  'Unmasked phone numbers detected' as issue,
  COUNT(*) as violation_count
FROM {{ ref('customers_masked') }}
WHERE phone_number NOT LIKE '***-***-%'
  AND phone_number IS NOT NULL
HAVING COUNT(*) > 0

Data Quality Tests for Masked Fields

-- tests/test_masked_data_quality.sql

-- Test: Masked emails still follow expected format
SELECT 
  'Invalid masked email format' as test_name,
  COUNT(*) as failures
FROM {{ ref('customers_masked') }}
WHERE email IS NOT NULL
  AND email != ''  
  AND email NOT REGEXP r'^[a-zA-Z0-9]{1,2}\*{3}@[a-zA-Z0-9]+\.com$'
HAVING COUNT(*) > 0

UNION ALL

-- Test: Masked phone numbers preserve length information
SELECT
  'Masked phone format inconsistent' as test_name,
  COUNT(*) as failures  
FROM {{ ref('customers_masked') }}
WHERE phone_number IS NOT NULL
  AND phone_number NOT LIKE '***-***-____'
HAVING COUNT(*) > 0

UNION ALL

-- Test: All high-sensitivity fields are masked in dev
SELECT
  'High sensitivity data not masked in dev' as test_name,
  COUNT(*) as failures
FROM {{ ref('stg_employees') }}
WHERE '{{ target.name }}' = 'dev'
  AND (ssn NOT LIKE '***-**-____' OR ssn IS NULL)
HAVING COUNT(*) > 0

๐Ÿ“‹ Masking Documentation and Governance

Data Catalog with Masking Info

# models/schema.yml
version: 2

models:
  - name: customers_masked
    description: |
      Customer data with PII masking applied for non-production environments.
      
      **Masking Rules Applied:**
      - Email: Shows first 2 chars + ***@***.com
      - Phone: Shows last 4 digits as ***-***-XXXX  
      - Names: Shows first 2 chars + asterisks
      - Address: Completely masked
      
      **Environment Behavior:**
      - Dev/Test: All PII fields masked
      - Staging: Medium/High sensitivity masked
      - Production: No masking (full data access)
      
    meta:
      data_classification: 'pii_sensitive'
      masking_applied: true
      compliance_standards: ['GDPR', 'CCPA']
      
    columns:
      - name: customer_id
        description: "Unique customer identifier (never masked)"
        
      - name: email  
        description: "Customer email address"
        meta:
          masking_rule: "mask_email"
          sensitivity_level: "medium"
          original_format_preserved: true
          
      - name: phone_number
        description: "Customer phone number" 
        meta:
          masking_rule: "mask_phone"
          sensitivity_level: "medium"
          last_4_digits_visible: true

Compliance Reporting

-- models/governance/masking_compliance_report.sql
{{ config(materialized='table') }}

WITH model_masking_status AS (
  SELECT
    'customers_masked' as model_name,
    'customers' as source_table,
    ['email', 'phone_number', 'first_name', 'last_name', 'address_line_1'] as masked_columns,
    'PII protection for customer data' as masking_purpose,
    ['GDPR', 'CCPA'] as compliance_standards
    
  UNION ALL
  
  SELECT  
    'orders_with_privacy' as model_name,
    'orders' as source_table,
    ['customer_email', 'payment_method'] as masked_columns,
    'Financial data protection' as masking_purpose,
    ['PCI_DSS'] as compliance_standards
),

environment_coverage AS (
  SELECT
    model_name,
    CASE '{{ target.name }}'
      WHEN 'dev' THEN 'Full masking active'
      WHEN 'staging' THEN 'Partial masking active'  
      WHEN 'prod' THEN 'No masking (authorized access)'
    END as environment_status
  FROM model_masking_status
)

SELECT
  m.model_name,
  m.source_table,
  ARRAY_TO_STRING(m.masked_columns, ', ') as protected_fields,
  m.masking_purpose,
  ARRAY_TO_STRING(m.compliance_standards, ', ') as compliance_met,
  e.environment_status,
  CURRENT_TIMESTAMP as report_generated_at

FROM model_masking_status m
JOIN environment_coverage e USING (model_name)

๐Ÿš€ Advanced Use Cases

Dynamic Masking Based on User Context

-- macros/context_aware_masking.sql
{% macro mask_by_user_context(column, user_department, user_role) %}
  CASE 
    WHEN {{ user_role }} = 'admin' THEN {{ column }}
    WHEN {{ user_department }} = 'finance' AND '{{ column }}'.contains('payment') 
    THEN {{ column }}
    WHEN {{ user_department }} = 'marketing' AND '{{ column }}'.contains('email')
    THEN {{ mask_email(column) }}
    ELSE '***UNAUTHORIZED***'
  END
{% endmacro %}

-- Usage in model
SELECT 
  customer_id,
  {{ mask_by_user_context('email', 'current_user_dept', 'current_user_role') }} as email
FROM customers

Audit Trail for Masked Data Access

-- models/governance/masked_data_access_log.sql
{{ config(
  materialized='incremental',
  unique_key=['access_timestamp', 'user_id', 'model_accessed']
) }}

SELECT
  CURRENT_TIMESTAMP as access_timestamp,
  '{{ target.name }}' as environment,
  SESSION_USER() as user_id,
  '{{ this.identifier }}' as model_accessed,
  'masked_data_query' as access_type,
  ARRAY['email', 'phone'] as masked_fields_accessed

-- This would be expanded with actual session tracking

๐Ÿ’ก Best Practices Summary

โœ… Do’s

  1. Centralize masking logic in reusable macros
  2. Test masking effectiveness with automated tests
  3. Document masking rules clearly in schema files
  4. Use environment-aware masking strategies
  5. Implement role-based access controls
  6. Monitor and audit masked data access

โŒ Don’ts

  1. Hardcode masking directly in models
  2. Inconsistent masking patterns across models
  3. Forget to test masked data quality
  4. Over-mask non-sensitive data
  5. Under-mask in development environments
  6. Skip compliance documentation

๐ŸŽฏ Key Benefits Achieved

Consistency: Same masking rules applied everywhere Maintainability: Change masking logic in one place Testability: Validate masking with automated tests Compliance: Meet regulatory requirements systematically Flexibility: Adapt to different environments and roles Documentation: Clear audit trail of data protection measures

This macro-based approach transforms data masking from a scattered manual process into a systematic, governed, and maintainable data protection strategy! ๐Ÿ›ก๏ธ

Leave a Reply

Your email address will not be published. Required fields are marked *