Advanced Data Warehousing

May 22, 2024

Snowflake Data Quality Metrics/Functions

In the world of data-driven decision-making, ensuring the accuracy and reliability of your information is crucial. Snowflake Data Metric Functions (DMFs) and Data Quality Metrics (DQMs) provide a robust mechanism to automate the monitoring of data quality, freeing you from the tedious task of manual checks. DMFs are user-defined functions that evaluate specific aspects of your data, like completeness, validity, uniqueness, and freshness.

Snowflake offers two types of DMFs:

  • System DMFs: Built-in functions for common data quality checks. For example, NULL_COUNT to detect missing values, UNIQUE_COUNT to measure cardinality (the number of unique values), DUPLICATE_COUNT to identify duplicates, and FRESHNESS to assess how recent your data is.

  • Custom DMFs: Allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.

A Practical Example: Sales Data Quality

Let's explore how to use DMFs in a practical scenario.

⛁ Sample Data Model: salesdb-data-model

Setup the SalesDB sample data model and proceed with the rest of the exercises. This model provides a robust foundation for tracking customers, buyers, clients, and sales opportunities. To demonstrate the power of DMFs, we'll insert both valid and invalid records into our tables.

Data Quality Issues Introduced

Customer Table:

  • Missing FirstName

  • Missing HomeLocation

  • Missing ZipCode

  • Duplicate Email

  • Invalid Email

  • Stale LoadDate

Opportunities Table:

  • Missing CustomerID

  • Missing ExpectedCloseDate

  • Missing Amount

  • Negative Amount

  • Duplicate Opportunity (same OpportunityID)

  • Invalid SalesStage

  • Stale LoadDate (for a "Closed Won" opportunity)

Explanation: The customer table is a relational database table that stores information about customers, including their ID, name, email address, location, zip code, and the date and time when their data was loaded into the table. The LoadDate field has a default value that is set to the current timestamp when a new row is inserted. This can be used to track the freshness of the data.

Role-Based Access Control (RBAC) for DMF Management

Data quality is a shared responsibility, but different roles within your organization might have varying levels of control over DMFs. Snowflake's RBAC model allows you to fine-tune these permissions.

  • Account Administrators: Typically have full control over creating, modifying, and deleting DMFs. They also manage the schedules for DMF execution.

  • Data Engineers/Analysts: Might be granted privileges to view DMF results and potentially create or modify DMFs within their scope of responsibility.

  • Sales Representatives: Might have read-only access to DMF results to understand data quality issues affecting their territories.

Let's start by setting up the necessary permissions for a data_engineer role to manage DMFs. You can use GRANT and REVOKE statements to control access to DMFs:

>_ SQL

-- Setting up Data Metric Functions - RBAC
USE ROLE ACCOUNTADMIN;
ALTER SESSION SET TIMEZONE = 'America/Chicago'--Optional
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE data_engineer; 
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE data_engineer;    
GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE

This allows the data_engineer to create, execute, and monitor DMFs. 

Testing System DMFs

This part of the code uses Snowflake's built-in Data Metric Functions (DMFs) to directly assess the quality of the data in the Customer table. These DMFs count the number of null values, unique values, and duplicate values in the specified columns (firstname and email in this case). 

>_ SQL

-- Test System DMFs (Direct Execution)
SELECT SNOWFLAKE.CORE.NULL_COUNT(SELECT firstname FROM customer);
SELECT SNOWFLAKE.CORE.UNIQUE_COUNT(SELECT firstname FROM customer);
SELECT SNOWFLAKE.CORE.DUPLICATE_COUNT(SELECT email FROM customer);
SELECT SNOWFLAKE.CORE.FRESHNESS(SELECT loaddate FROM customer) / 60 AS

The last statement demonstrates the use of the FRESHNESS system DMF, which measures the age of data. It calculates the time difference in minutes between the current time and the latest timestamp in the LoadDate column. This information can be used to determine the freshness of the data in the table.

Creating Custom DMFs

In addition to out of the box System DMFs, custom DMFs allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.

Now, let's create several custom DMFs to demonstrate their versatility:

>_ SQL

-- Custom DMF for Data Freshness (in hours)
CREATE OR REPLACE DATA METRIC FUNCTION data_freshness_hour(ARG_T TABLE (ARG_C TIMESTAMP_LTZ))
RETURNS NUMBER AS
'SELECT TIMEDIFF(minute, MAX(ARG_C), SNOWFLAKE.CORE.DATA_METRIC_SCHEDULED_TIME()) FROM ARG_T';

-- Custom DMF to Count Invalid Email Addresses
CREATE OR REPLACE DATA METRIC FUNCTION invalid_email_count(arg_t TABLE(arg_c1 VARCHAR))
RETURNS NUMBER
AS
$$
 SELECT COUNT(*) AS invalid_email_count
  FROM arg_t
 WHERE NOT REGEXP_LIKE(arg_c1, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')
$$;

-- Custom DMF to Count Invalid Sales Stages
CREATE OR REPLACE DATA METRIC FUNCTION invalid_stage_count(arg_t TABLE(arg_c1 VARCHAR))
RETURNS NUMBER
AS
$$
SELECT
  CASE
    WHEN COUNT(*) = 0 THEN NULL -- No rows in the table
    ELSE
      SUM(
        CASE
          WHEN arg_c1 IN ('Prospecting', 'Qualification', 'Proposal', 'Negotiation', 'Closed Won', 'Closed Lost') THEN 0
          ELSE 1
        END
      )
  END
FROM arg_t
$$;

-- Custom DMF to Count Composite Duplicates (e.g., First Name and Last Name)
CREATE OR REPLACE DATA METRIC FUNCTION COMPOSITE_DUPLICATE_COUNT(arg_t TABLE(arg_c1 VARCHAR, arg_c2 VARCHAR))
RETURNS NUMBER
AS
$$
 SELECT COUNT(*)
  FROM (SELECT ARG_C1, ARG_C2, COUNT(*) CNT
      FROM ARG_T
      GROUP BY ALL
      HAVING COUNT(*) > 1)

The code above tests the functionality of the custom DMFs created earlier. Each line selects the result of calling a specific DMF on a column from the table. The functions return different values based on the specific test:

  • invalid_email_count: The number of invalid email addresses in the email column.

  • invalid_stage_count: The number of rows with invalid values in the salesstage column.

  • COMPOSITE_DUPLICATE_COUNT: The count of duplicate rows where both firstname and lastname are the same.

  • data_freshness_hour: The freshness of the data, measured as the time difference in minutes since the last data load.

Testing Custom DMFs

You can test these custom DMFs by executing queries like:

>_SQL

-- Test DMF 
select invalid_email_count (select email from customer);
select invalid_stage_count (select salesstage from opportunities);
SELECT COMPOSITE_DUPLICATE_COUNT(SELECT firstname, lastname FROM customer);
SELECT data_freshness_hour(SELECT loaddate FROM customer)

Explanation: Using the user-defined data metric function (DMF) called invalid_email_count as an example, it takes a table as input (specifically, a table containing email addresses in the arg_c1 column). The function then counts how many email addresses in the table are invalid by checking if they match a regular expression pattern for valid email formats.

Scheduling DMFs

To automate data quality checks, you can schedule DMFs to run periodically or when changes occur in the table. You can use cron expressions or trigger-based scheduling for flexibility.

  • Before attaching DMF to a table or view, ensure that the object (table/view) has a schedule defined for it. Snowflake requires a schedule to be attached to the object before DMF is attached. Here are the various ways to set the schedule:

>_ SQL

-- Example of setting run every 5 mins
ALTER TABLE CUSTOMER SET DATA_METRIC_SCHEDULE = '5 Minutes';

-- Example using cron syntax but same pattern.. set schedule and then associate data metric function.
ALTER TABLE CUSTOMER
SET DATA_METRIC_SCHEDULE = 'USING CRON 0 12 * * MON,TUE,WED,THU,FRI EST';

-- Example of trigger on changes. 
ALTER TABLE CUSTOMER
SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES'

Attaching DMFs to Tables and Views

You can associate system and custom DMFs with tables using the ALTER TABLE statement. This allows Snowflake to automatically run these checks based on your schedule:

>_ SQL

-- Add/Attach Systems DMF to Tables (These tables have schedules)
ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.DUPLICATE_COUNT ON (email);

ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (email);

ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.UNIQUE_COUNT ON (email);

ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.FRESHNESS ON (loaddate)

Visualizing Data Quality Metrics

Snowflake offers multiple avenues to visualize the results of your DMFs:

  • Data Quality Monitoring UI: Snowflake's built-in interface provides an overview of DMF execution history, results, and trends. This is a convenient way to quickly assess the health of your data.

  • Snowsight Dashboards: Leverage Snowsight, Snowflake's data exploration tool, to create custom dashboards that visually represent DMF results using charts, graphs, and other visualizations.

  • Programmatic Access: You can directly query the DATA_QUALITY_MONITORING_RESULTS view using SQL to extract data for external reporting tools or custom applications.

By combining these visualization techniques, you can create a comprehensive data quality dashboard that keeps stakeholders informed and empowered to take action.

Visualizing Overall Data Quality Summary

Let's craft some SQL queries you can use within your Streamlit app to fetch and visualize the data quality metrics generated by your Snowflake DMFs.

This query aggregates results across all DMFs and tables, providing a high-level overview of data quality issues:

>_ SQL

-- Validate Monitoring Results
SELECT * 
 FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
ORDER BY MEASUREMENT_TIME DESC;

-- Validate Specific Database and Schema
SELECT 
    table_name, 
    metric_name, 
    COUNT(*) AS num_issues,
    SUM(CASE WHEN value = FALSE THEN 1 ELSE 0 END) AS num_failures
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE 
    TABLE_DATABASE = 'SALESDB' AND 
    TABLE_SCHEMA = 'CUSTS' 
GROUP BY table_name,

Query 2: Detailed View of Failed Records (per Table)

This query drills down into specific tables and DMFs checks:

>_ SQL

SELECT * 
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE 
    TABLE_DATABASE = 'SALESDB' AND 
    table_schema = 'CUSTS' AND 
    table_name = 'CUSTOMER' AND -- Replace with the desired table
    metric_name = 'INVALID_EMAIL_COUNT'

Query 3: Data Quality Trends Over Time

This query aggregates DMF results by date, allowing you to visualize trends and identify patterns: 

>_ SQL

SELECT 
    DATE(measurement_time) AS execution_date,
    table_name, 
    metric_name, 
    SUM(value) AS measure_counts
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE 
    table_database = 'SALESDB' AND 
    table_schema = 'CUSTS' 
GROUP BY execution_date, table_name, metric_name
ORDER BY

Streamlit Implementation

Here's a conceptual outline of how you might integrate these queries into a Streamlit app:

>_ Python

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
import pandas as pd
import plotly.express as px


# Initialize Snowflake session
session = get_active_session()


# Function to fetch data from Snowflake using the provided query
def fetch_data(session):
    query = """
    SELECT 
        DATE(measurement_time) AS execution_date,
        table_name, 
        metric_name, 
        SUM(value) AS measure_counts
    FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
    WHERE 
        table_database = 'SALESDB' AND 
        table_schema = 'CUSTS' 
    GROUP BY execution_date, table_name, metric_name
    ORDER BY execution_date;
    """
    return session.sql(query).to_pandas()

# Streamlit app
st.title("Snowflake Data Quality Metric Visualization")

with st.sidebar:
    st.subheader("Filter Options")
    tables = st.multiselect("Select Tables", ["Customer", "Opportunities"], default=["Customer", "Opportunities"])

# Fetch data from Snowflake
try:
    df = fetch_data(session)
    st.write("Data fetched successfully.")
except Exception as e:
    st.error(f"Error fetching data: {e}")
    df = pd.DataFrame()  # Ensure df is defined

# Filter data based on selected tables
if not df.empty:
    if 'TABLE_NAME' in df.columns:
        filtered_df = df[df["TABLE_NAME"].isin(tables)]

        # Create line chart for overall trend
        fig_trend = px.line(filtered_df, x="EXECUTION_DATE", y="MEASURE_COUNTS", color="TABLE_NAME",
                            title="Data Quality Metric Trend")
        st.plotly_chart(fig_trend)

        # Create bar chart for breakdown by metric
        fig_breakdown = px.bar(filtered_df, x="METRIC_NAME", y="MEASURE_COUNTS", color="TABLE_NAME", barmode="group",
                               title="Data Quality Breakdown by Metric")
        st.plotly_chart(fig_breakdown)
    else:
        st.error("The column 'TABLE_NAME' is not present in the data.")
        st.write("DataFrame columns:", df.columns)
else:
    st.write("No data available for the selected tables.")

Application Output

Monitoring DMF Credit Usage

While Snowflake DMFs are a valuable tool for maintaining data quality, it's important to be mindful of their resource consumption, as DMF execution consumes Snowflake credits. Fortunately, Snowflake provides visibility into the credits used by DMFs through the DATA_QUALITY_MONITORING_USAGE_HISTORY view in the ACCOUNT_USAGE schema.

Here's a query to gain insights into your DMF credit usage:

>_ SQL

-- Visibility into credits used by DMF
SELECT 
    DATEDIFF(second, start_time, end_time) AS DURATION, 
    *
FROM SNOWFLAKE.ACCOUNT_USAGE.DATA_QUALITY_MONITORING_USAGE_HISTORY
WHERE START_TIME >= CURRENT_TIMESTAMP - INTERVAL '3 days'
LIMIT 100

By actively monitoring your DMF credit usage, you can ensure that your data quality efforts remain cost-effective while providing the insights you need to make informed decisions.

Conclusion

By proactively monitoring and addressing data quality issues with Snowflake DMFs, you build a foundation of trust in your sales data. This enables you to make informed decisions, drive revenue growth, and build a successful sales organization.

Resources

Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

More from
Advanced Data Warehousing

Snowflake Data Quality Metrics/Functions

Snowflake Data Quality Metrics/Functions

Snowflake Data Quality Metrics/Functions

Snowflake Higher Order Functions (HoF)

Snowflake Higher Order Functions (HoF)

Snowflake Higher Order Functions (HoF)

Snowflake Data Classification & Tagging

Snowflake Data Classification & Tagging

Snowflake Data Classification & Tagging

Snowflake ASOF JOIN

Snowflake ASOF JOIN

Snowflake ASOF JOIN

Snowflake Soundex & Fuzzy Matching

Snowflake Soundex & Fuzzy Matching

Snowflake Soundex & Fuzzy Matching

Snowflake Aggregation Policies

Snowflake Aggregation Policies

Snowflake Aggregation Policies

Snowflake Projection Policies

Snowflake Projection Policies

Snowflake Projection Policies

Snowflake JMeter Load & Concurrency Test

Snowflake JMeter Load & Concurrency Test

Snowflake JMeter Load & Concurrency Test

Snowflake Memoizable Functions

Snowflake Memoizable Functions

Snowflake Memoizable Functions

Snowflake Trail For Obervability (+Logs, +Events, +Traces)

Snowflake Trail For Obervability (+Logs, +Events, +Traces)

Snowflake Trail For Obervability (+Logs, +Events, +Traces)

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved