Classic AI / ML (Data Science)

Jun 1, 2024

Snowflake Cortex AI: ML Forecasting

In the world of data-driven business, accurate forecasting is a critical component of success. Snowflake Cortex, with its powerful machine learning capabilities, empowers you to predict future trends with confidence be it sales, inventory or patient visits to providers. 


This guide will walk you through the process of building and evaluating sales forecast models using Snowflake Cortex, incorporating both simple and more sophisticated techniques.


Note: In addition to scripting, as demonstrated here, you can also create Cortex forecast models using the intuitive UI in Snowsight. This visual interface prompts you for input and automatically generates the corresponding SQL script, making it accessible for both technical and non-technical users.

A Practical Example: Sales Data Analysis

If you haven't done so already; run and setup the OrdersDB sample data model and proceed with the rest of the exercises. 

⛁ Sample Data Model: ordersdb-data-model

>_ SQL

-- Assign the database schema
use schema ordersdb.public

To practice the steps of building sales forecast models using Snowflake Cortex, we'll begin by assuming you have sales data available in tables named Sales_Order and Sales_Order_Item within your ORDERSDB database.

Preparing and Exploring Your Sales Data

Create a view (daily_sales) to consolidate your sales data. This view will aggregate daily sales totals while keeping track of relevant attributes like shipping mode (SHIP_MODE) and return status (return_flag). The SQL code for this step is:

>_ SQL

CREATE OR REPLACE VIEW daily_sales AS (
    SELECT
        o.order_date AS date, si.SHIP_MODE, si.return_flag,
        ROUND(SUM(si.quantity * (si.extended_price - si.discount))/100000,2) AS total_sales
    FROM Sales_Order o
    JOIN Sales_Order_Item si ON o.order_id = si.order_id
    GROUP BY o.order_date, si.SHIP_MODE, si.return_flag
    ORDER BY 1
)

Getting Acquainted with Your Data:

Before diving into forecasting, take a moment to understand your data. Calculate key statistics like the minimum and maximum dates in your dataset, the range of sales values, averages, medians, and standard deviations. It also shows us the interval of actual sales data up to Aug '98. We'll predict/forecast out from there.


This information provides a valuable baseline for evaluating the accuracy of your forecasts. You can achieve this using a simple SELECT query on your daily_sales view:

>_ SQL

SELECT
    MIN(date) as min_date,
    MAX(date) as max_date,
    MIN(total_sales) AS min_sales,
    MAX(total_sales) AS max_sales,
    AVG(total_sales) AS avg_sales,
    MEDIAN(total_sales) AS median_sales,
    STDDEV(total_sales) AS stddev_sales,
    COUNT(*) AS total_days
FROM

Splitting Data and Creating a Basic Forecast

To assess the accuracy of your forecast models, you need to split your data into a training set (used for building the model) and a testing set (used for evaluating the model's performance). A common practice is to use 80% of the data for training and 20% for testing. This step involves calculating a split date and creating views for your training (daily_sales_std_noreturn_train) and testing (daily_sales_std_noreturn_test) data:

>_ SQL

-- Calculate the split date for training by dividing the dates into quartiles 
-- and selecting the maximum date from the quartiles representing the first 80% of the data.
SELECT MAX(date) AS split_date
FROM (SELECT date, NTILE(5) OVER (ORDER BY date) AS quartile FROM daily_sales) ranked_dates
WHERE quartile < 5;

CREATE OR REPLACE VIEW daily_sales_std_noreturn_train AS
SELECT to_timestamp_ntz(DATE) as Date, total_sales
FROM daily_sales
WHERE SHIP_MODE = 'MAIL' AND return_flag = 'N' and to_timestamp_ntz(DATE) < (SELECT to_timestamp_ntz(split_date) FROM split_info)
ORDER BY date;  -- Use 80% for training

CREATE OR REPLACE VIEW daily_sales_std_noreturn_test AS
SELECT to_timestamp_ntz(DATE) as Date, total_sales
FROM daily_sales
WHERE SHIP_MODE = 'MAIL' AND return_flag = 'N' and to_timestamp_ntz(date) >= (SELECT to_timestamp_ntz(split_date) FROM split_info)
ORDER BY date;  -- Use 20% for training

Building Your First Simple Univariate Forecasting Model:

Start by creating a simple forecast model, daily_sales_std_noreturn_model. This model doesn't utilize any additional variables besides the date and the sales values. The process includes defining the model, generating predictions, and storing them in a table (daily_sales_std_noreturn_forecasts). You can then visualize your forecasts against the actual sales data.

First, build a forecast model (daily_sales_std_noreturn_model) using the training data:

>_ SQL

CREATE SNOWFLAKE.ML.FORECAST daily_sales_std_noreturn_model (
    INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'daily_sales_std_noreturn_train'),
    TIMESTAMP_COLNAME => 'DATE',
    TARGET_COLNAME => 'TOTAL_SALES'
)

Generate predictions and store them in a table (daily_sales_std_noreturn_forecasts):

>_ SQL

CREATE OR REPLACE TABLE daily_sales_std_noreturn_forecasts AS
SELECT * FROM TABLE(daily_sales_std_noreturn_model!FORECAST(
    FORECASTING_PERIODS => 30, 
    CONFIG_OBJECT => {'prediction_interval': 0.8}
))


Visualizing the Forecast

Combine historical data with predictions to create a visual representation of your forecast, including confidence intervals.


To visualize this model in Snowsight as show above:

Run the following query to prepare your data for visualizing your forecasts against the actual sales data.:

>_ SQL

SELECT DATE, TOTAL_SALES AS actual, NULL AS forecast, NULL AS lower_bound, NULL AS upper_bound
    FROM daily_sales_std_noreturn_train
UNION ALL
SELECT ts as DATE, NULL AS actual, forecast, lower_bound, upper_bound
    FROM daily_sales_std_noreturn_forecasts
ORDER BY DATE DESC

  • Switch to the "Chart" tab in Snowsight.

  • In the Data section, select the ACTUAL column and set Aggregation to "None."

  • Select the DATE (or TS) column and set Bucketing to "None."

  • Add the FORECAST column, choose "Use as Line", and set Aggregation to "None."

  • Add the LOWER_BOUND and UPPER_BOUND columns in the same way.

You should now see a line chart displaying your historical sales data (ACTUAL), the forecasted values (FORECAST), and the upper and lower bounds of the prediction interval. This visualization helps you understand the model's predictions over time.

Incorporating More Factors into the Forecast

Exogenous Variables: To make your forecast more comprehensive, you can incorporate exogenous variables like shipping mode and return status. This involves creating new views that include these variables and training a new model that leverages this additional information.

Create new training (daily_sales_train) and testing (daily_sales_test) views with these variables, then train a model (daily_sales_forecasts_model) that utilizes them:

>_ SQL

CREATE OR REPLACE VIEW daily_sales_train AS
SELECT to_timestamp_ntz(DATE) as DATE, [return_flag, ship_mode] as ship_return, total_sales
FROM daily_sales
WHERE date < (SELECT split_date FROM split_info)
GROUP BY date, ship_return, total_sales
ORDER BY date;

CREATE OR REPLACE VIEW daily_sales_test AS
SELECT to_timestamp_ntz(DATE) as DATE, [return_flag, ship_mode] as ship_return, total_sales
FROM daily_sales
WHERE date >= (SELECT split_date FROM split_info)
GROUP BY date, ship_return, total_sales
ORDER BY date;

CREATE OR REPLACE SNOWFLAKE.ML.FORECAST daily_sales_forecasts_model(
 INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'daily_sales_train'),
 SERIES_COLNAME => 'ship_return',  --Exogenous Variables now included
 TIMESTAMP_COLNAME => 'date',
 TARGET_COLNAME => 'total_sales'
);

-- ... (As before, proceed to Generate predictions and store in daily_sales_forecasts table)

Evaluating Model Performance:

Inspection and Analysis: Snowflake Cortex provides convenient functions to inspect the accuracy and internal workings of your model. Use SHOW_EVALUATION_METRICS to get performance metrics, EXPLAIN_FEATURE_IMPORTANCE to understand the impact of different variables, and SHOW_TRAINING_LOGS for detailed diagnostics.

Evaluate both models using Snowflake's built-in functions to gain insights into their accuracy, feature importance, and training logs:

SQL

--Evaluate daily_sales_std_noreturn_model
CALL daily_sales_std_noreturn_model!SHOW_EVALUATION_METRICS();
CALL daily_sales_std_noreturn_model!EXPLAIN_FEATURE_IMPORTANCE();
CALL daily_sales_std_noreturn_model!SHOW_TRAINING_LOGS();

-- Similarly for daily_sales_forecasts_model

This allows you to compare the performance of the simple and enhanced models, understand which features contribute most to the predictions, and diagnose potential issues. 

Conclusion

By following these steps, you'll have a solid foundation for utilizing Snowflake Cortex to create accurate sales forecasts that can drive your business decisions. Remember to adapt and refine this process based on the specific characteristics of your data and business requirements.

Resources



Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

More from
Classic AI / ML (Data Science)

Snowflake Cortex AI: ML Forecasting

Snowflake Cortex AI: ML Forecasting

Snowflake Cortex AI: ML Forecasting

Snowflake Cortex AI: ML Anomaly Detection

Snowflake Cortex AI: ML Anomaly Detection

Snowflake Cortex AI: ML Anomaly Detection

Snowflake Cortex AI: ML Classification

Snowflake Cortex AI: ML Classification

Snowflake Cortex AI: ML Classification

Snowpark ML (+Notebooks, +Model Registry, +Feature Store)

Snowpark ML (+Notebooks, +Model Registry, +Feature Store)

Snowpark ML (+Notebooks, +Model Registry, +Feature Store)

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved