Advanced Data Warehousing
May 23, 2024
Snowflake Higher Order Functions (HoF)
Higher-order functions are functions that can take other functions as arguments or return functions as results. In the context of data analysis, higher-order functions are crucial as they enable a more abstract and expressive way to manipulate and process data. They allow for the creation of reusable and composable code, facilitating tasks such as filtering, mapping, reducing, and transforming datasets with concise and readable syntax. By utilizing higher-order functions, data analysts can write code that is more modular, easier to understand, and maintainable, ultimately leading to more efficient and effective data analysis workflows.
For example, when dealing with JSON data, FILTER can be used to extract or filter out specific JSON objects or arrays based on criteria defined within the JSON structure. Similarly, when data is stored in Snowflake arrays, TRANSFORM allows for selective inclusion or exclusion of array elements based on conditions, providing a streamlined approach to data processing.
Benefits of Higher Order Functions over Lateral Flatten
Readability: HOFs offer more concise and expressive syntax than LATERAL FLATTEN. They let you keep the focus on the transformation logic, avoiding the nested structure of LATERAL FLATTEN.
Performance: HOFs can often execute faster than LATERAL FLATTEN, especially with complex transformations. They operate directly on the array data without the need to create a temporary table.
Maintenance: HOFs lead to more maintainable code. The lambda expressions within HOFs clearly define the operation to be performed, making it easier to understand and modify the logic.
Quick Example:
A Practical Example: Sales Data Analysis
Let's explore how to use Snowflake Higher Order Functions (HoF) in a practical scenario.
⛁ Sample Data Model: salesdb-data-model
Setup the SalesDB sample data model and proceed with the rest of the exercises. This model provides a robust foundation for tracking customers, buyers, clients, and sales opportunities.
>_ SQL
Create and load records into the customer table.
>_ SQL
Let's delve into how Snowflake HoFs can supercharge your data workflow:
Task: You need to calculate the total revenue generated by each customer across all their orders and only show customers with revenue GREATER THAN 15000.
Without Higher Order Functions (HOF)
Verify that the records have been loaded into the table. Notice the last column is a Snowflake variant data type holding the Json payload.
>_ SQL
Let’s dial in and look at one specific record, where Customer ID is 1
>_ SQL
Notice one sample JSON payload has multiple records. This is an iterable we need to loop through to calculate the Total Revenue (i.e. Price * Quantity), and then FILTER where the value doesn’t meet the predicate (i.e. greater than 15000).
First we Flatten the records. To go from one record in a column, into multiple rows.
>_ SQL
Pull out the values of interest from the ARRAY record
>_ SQL
Calculate the total_revenue with basic math.
>_ SQL
FILTER records that don’t meet the predicate of GREATER THAN 15000
>_ SQL
Return the result to show the entire ARRAY of records that weren’t filtered out. Just for readability and formatting purposes.
>_ SQL
This is one way to achieve the task using the existing FLATTEN function. Next we would explore how to do that using the newer Higher Order Functions (HoF) in Snowflake.
With Higher Order Functions
For this task, we would use HoF to identify high-value customers based on the total purchase amount.
First, Verify the records in the variant column as JSON
>_ SQL
Using FILTER HoF to go through the iterable (i.e. order_detail) and apply a function to it that calculates price * quantity and checks the predicate. This only returns records that meet the check criteria (15000) in this case. Records that don’t meet that criteria are filtered or simply ignored in the result set.
>_ SQL
Notice, Null arrays are returned. Filter those out by adding a predicate in the where clause to check for array size.
>_ SQL
Comparison:
The HOF version is more concise and focuses on the transformation (UPPER) directly. It's also more flexible, as you can easily modify the lambda expression to perform other operations.
When to Use Each Approach:
HOFs: Preferred for simple transformations or filtering of array elements.
Lateral Flatten: Useful when you need to join the flattened data with other tables or perform complex aggregations that require a relational structure.
Additional Considerations:
Nested Arrays: HOFs are primarily designed for simple arrays. Lateral Flatten is more suitable for handling nested arrays.
Feature Support: Snowflake continues to enhance its HOF capabilities, so check the latest documentation for the most up-to-date functionality.
Resources
Snowflake Documentation: Querying Semi-structured Data | Snowflake Documentation