Advanced Data Warehousing
May 22, 2024
Snowflake Data Quality Metrics/Functions
In the world of data-driven decision-making, ensuring the accuracy and reliability of your information is crucial. Snowflake Data Metric Functions (DMFs) and Data Quality Metrics (DQMs) provide a robust mechanism to automate the monitoring of data quality, freeing you from the tedious task of manual checks. DMFs are user-defined functions that evaluate specific aspects of your data, like completeness, validity, uniqueness, and freshness.
Snowflake offers two types of DMFs:
System DMFs: Built-in functions for common data quality checks. For example, NULL_COUNT to detect missing values, UNIQUE_COUNT to measure cardinality (the number of unique values), DUPLICATE_COUNT to identify duplicates, and FRESHNESS to assess how recent your data is.
Custom DMFs: Allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.
A Practical Example: Sales Data Quality
Let's explore how to use DMFs in a practical scenario.
⛁ Sample Data Model: salesdb-data-model
Setup the SalesDB sample data model and proceed with the rest of the exercises. This model provides a robust foundation for tracking customers, buyers, clients, and sales opportunities. To demonstrate the power of DMFs, we'll insert both valid and invalid records into our tables.
Data Quality Issues Introduced
Customer Table:
Missing FirstName
Missing HomeLocation
Missing ZipCode
Duplicate Email
Invalid Email
Stale LoadDate
Opportunities Table:
Missing CustomerID
Missing ExpectedCloseDate
Missing Amount
Negative Amount
Duplicate Opportunity (same OpportunityID)
Invalid SalesStage
Stale LoadDate (for a "Closed Won" opportunity)
Explanation: The customer table is a relational database table that stores information about customers, including their ID, name, email address, location, zip code, and the date and time when their data was loaded into the table. The LoadDate field has a default value that is set to the current timestamp when a new row is inserted. This can be used to track the freshness of the data.
Role-Based Access Control (RBAC) for DMF Management
Data quality is a shared responsibility, but different roles within your organization might have varying levels of control over DMFs. Snowflake's RBAC model allows you to fine-tune these permissions.
Account Administrators: Typically have full control over creating, modifying, and deleting DMFs. They also manage the schedules for DMF execution.
Data Engineers/Analysts: Might be granted privileges to view DMF results and potentially create or modify DMFs within their scope of responsibility.
Sales Representatives: Might have read-only access to DMF results to understand data quality issues affecting their territories.
Let's start by setting up the necessary permissions for a data_engineer role to manage DMFs. You can use GRANT and REVOKE statements to control access to DMFs:
>_ SQL
This allows the data_engineer to create, execute, and monitor DMFs.
Testing System DMFs
This part of the code uses Snowflake's built-in Data Metric Functions (DMFs) to directly assess the quality of the data in the Customer table. These DMFs count the number of null values, unique values, and duplicate values in the specified columns (firstname and email in this case).
>_ SQL
The last statement demonstrates the use of the FRESHNESS system DMF, which measures the age of data. It calculates the time difference in minutes between the current time and the latest timestamp in the LoadDate column. This information can be used to determine the freshness of the data in the table.
Creating Custom DMFs
In addition to out of the box System DMFs, custom DMFs allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.
Now, let's create several custom DMFs to demonstrate their versatility:
>_ SQL
The code above tests the functionality of the custom DMFs created earlier. Each line selects the result of calling a specific DMF on a column from the table. The functions return different values based on the specific test:
invalid_email_count: The number of invalid email addresses in the email column.
invalid_stage_count: The number of rows with invalid values in the salesstage column.
COMPOSITE_DUPLICATE_COUNT: The count of duplicate rows where both firstname and lastname are the same.
data_freshness_hour: The freshness of the data, measured as the time difference in minutes since the last data load.
Testing Custom DMFs
You can test these custom DMFs by executing queries like:
>_SQL
Explanation: Using the user-defined data metric function (DMF) called invalid_email_count as an example, it takes a table as input (specifically, a table containing email addresses in the arg_c1 column). The function then counts how many email addresses in the table are invalid by checking if they match a regular expression pattern for valid email formats.
Scheduling DMFs
To automate data quality checks, you can schedule DMFs to run periodically or when changes occur in the table. You can use cron expressions or trigger-based scheduling for flexibility.
Before attaching DMF to a table or view, ensure that the object (table/view) has a schedule defined for it. Snowflake requires a schedule to be attached to the object before DMF is attached. Here are the various ways to set the schedule:
>_ SQL
Attaching DMFs to Tables and Views
You can associate system and custom DMFs with tables using the ALTER TABLE statement. This allows Snowflake to automatically run these checks based on your schedule:
>_ SQL
Visualizing Data Quality Metrics
Snowflake offers multiple avenues to visualize the results of your DMFs:
Data Quality Monitoring UI: Snowflake's built-in interface provides an overview of DMF execution history, results, and trends. This is a convenient way to quickly assess the health of your data.
Snowsight Dashboards: Leverage Snowsight, Snowflake's data exploration tool, to create custom dashboards that visually represent DMF results using charts, graphs, and other visualizations.
Programmatic Access: You can directly query the DATA_QUALITY_MONITORING_RESULTS view using SQL to extract data for external reporting tools or custom applications.
By combining these visualization techniques, you can create a comprehensive data quality dashboard that keeps stakeholders informed and empowered to take action.
Visualizing Overall Data Quality Summary
Let's craft some SQL queries you can use within your Streamlit app to fetch and visualize the data quality metrics generated by your Snowflake DMFs.
This query aggregates results across all DMFs and tables, providing a high-level overview of data quality issues:
>_ SQL
Query 2: Detailed View of Failed Records (per Table)
This query drills down into specific tables and DMFs checks:
>_ SQL
Query 3: Data Quality Trends Over Time
This query aggregates DMF results by date, allowing you to visualize trends and identify patterns:
>_ SQL
Streamlit Implementation
Here's a conceptual outline of how you might integrate these queries into a Streamlit app:
>_ Python
Application Output
Monitoring DMF Credit Usage
While Snowflake DMFs are a valuable tool for maintaining data quality, it's important to be mindful of their resource consumption, as DMF execution consumes Snowflake credits. Fortunately, Snowflake provides visibility into the credits used by DMFs through the DATA_QUALITY_MONITORING_USAGE_HISTORY view in the ACCOUNT_USAGE schema.
Here's a query to gain insights into your DMF credit usage:
>_ SQL
By actively monitoring your DMF credit usage, you can ensure that your data quality efforts remain cost-effective while providing the insights you need to make informed decisions.
Conclusion
By proactively monitoring and addressing data quality issues with Snowflake DMFs, you build a foundation of trust in your sales data. This enables you to make informed decisions, drive revenue growth, and build a successful sales organization.