Gen AI/LLMs

Jun 1, 2024

Snowflake Document AI (DocAI)

In today's data-driven landscape, a staggering 80% of enterprise data is unstructured, residing in formats like PDFs, images, and emails. This vast reservoir of untapped information often sits idle, as traditional data processing tools struggle to extract meaningful insights from its unorganized nature.

From invoices and purchase orders in the retail sector to patient records in healthcare and equipment logs in manufacturing, organizations across industries grapple with the challenge of extracting value from unstructured data. The inability to easily search, analyze, and integrate this data with structured information creates a significant bottleneck, hindering decision-making and stifling innovation.


Consider, for example, the invoice depicted above. It contains critical information such as customer details, order items, and financial figures. However, without specialized tools, extracting this data would require time-consuming manual efforts, prone to errors and delays. This is a challenge faced by countless businesses across various sectors.

Setting the Stage

Before we dive into Document AI, let's set up our environment:

>_ SQL

CREATE OR

We'll work within a dedicated database called invoicesdb to keep things organized.

Access Control for Document AI

To enable Document AI features, we need to grant the necessary permissions to the role or user that will be working with it:

>_ SQL

USE ROLE accountadmin;  

CREATE OR REPLACE ROLE doc_ai_role;
GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE doc_ai_role;

-- Grant permissions for the invoicesdb database and warehouse access
GRANT USAGE ON DATABASE invoicesdb TO ROLE doc_ai_role;
GRANT USAGE ON SCHEMA invoicesdb.public TO ROLE doc_ai_role;
GRANT USAGE, OPERATE ON WAREHOUSE demo_wh TO ROLE doc_ai_role;

-- Grant permissions for stages and models
GRANT CREATE STAGE ON SCHEMA invoicesdb.public TO ROLE doc_ai_role;
GRANT CREATE SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE ON SCHEMA invoicesdb.public TO ROLE doc_ai_role;

-- Grant permissions for streams, tables, tasks, and views
GRANT CREATE STREAM, CREATE TABLE, CREATE TASK, CREATE VIEW ON SCHEMA invoicesdb.public TO ROLE doc_ai_role;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE doc_ai_role;

GRANT ROLE doc_ai_role TO USER demo_user; -- Replace 'demo_user' with your Snowflake username

Now, the user demo_user (or your username) can harness the full power of Document AI.

Storing and Preparing Invoice Data

For this demonstration, we'll be using a set of sample invoices obtained from the GitHub repository https://github.com/femstac/Sample-Pdf-invoices. These invoices vary in layout and format, allowing us to explore how Snowflake Document AI can adapt to different document structures. (downloaded: Feb 7, 2024)



>_ SQL

-- Create External Stage
CREATE OR REPLACE STAGE DEMOHUB_S3_STG
   URL = 's3://demohubpublic/data/invoices/'
   DIRECTORY = (ENABLE = true)
   COMMENT = 'DemoHub S3 datasets';

-- Create a Snowflake stage (optional, but recommended for security)
CREATE OR REPLACE STAGE INVOICES_STG
   DIRECTORY = (ENABLE = true)
   ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE'); -- Server-Side Encryption for added security

-- Copy data from the S3 stage to the Snowflake stage
COPY FILES INTO @INVOICES_STG
FROM @DEMOHUB_S3_STG;

-- Refresh the Snowflake stage to ensure visibility
ALTER

  • This will create the external stage where your raw data will reside, and the internal stage where you will stage the data for processing.

Training Your Customer AI Models in Snowsight

While Snowflake Document AI offers a powerful pre-trained model, fine-tuning it to your specific document types and extraction needs is crucial for optimal performance. Snowsight, Snowflake's intuitive web interface, provides a streamlined way to train, version and publish your models to look the one illustrated below.


Steps for Training:



  1. Access Document AI: In Snowsight, navigate to AI & ML > Document AI.

  2. Select Your Model: Choose the model you want to train from the list of existing builds.

  3. Upload Documents: Upload a set of representative documents for training. Ensure these documents are labeled with the correct information you want to extract.

  4. Label Fields (Optional): If your model requires more specific training, manually label the relevant fields in the documents within Snowsight. This helps the model learn the patterns and context for accurate extraction.

  5. Start Training: Click the "Train Model" button. Snowflake will process your labeled documents and fine-tune the model's parameters.

  6. Monitor Progress: Monitor the training progress in Snowsight. This typically involves evaluating the model's performance on a separate set of validation documents.

  7. Iterate (Optional): If the model's accuracy isn't satisfactory, you can iterate on the training process by adding more labeled documents, refining labels, or adjusting model parameters.

Prompt Design for Document AI

Defining the right prompts is crucial for accurate and effective data extraction. You can use either:

  • Direct Prompts: Concise questions like "What is the bill to name?" or "What's the Invoice Number?"

  • Elaborate Prompts: More natural language-oriented questions like "Who is the bill being sent to?" or "What is the reference number for this invoice?"


Here's a table summarizing the prompts you can use to extract data from invoices:




Testing Document AI Predictions

With the data in place, you can test Document AI's predictions on a few sample invoices:

>_ SQL

SELECT INVOICESDB.PUBLIC.DAI!PREDICT(
 GET_PRESIGNED_URL(@INVOICES_STG, RELATIVE_PATH), 1)
FROM DIRECTORY(@INVOICES_STG) limit 2

  • This is just a test of the out of the box model. This query calls the pre-trained model (INVOICESDB.PUBLIC.DAI) on the first two invoices in your stage, returning the extracted fields in JSON format.

Storing and Processing Predictions

SQL

-- Create a table to store predictions
CREATE OR REPLACE TABLE invoices_fields
   AS
SELECT *, INVOICESDB.PUBLIC.DAI!PREDICT(
 GET_PRESIGNED_URL(@INVOICES_STG, RELATIVE_PATH), 3) as predictions
FROM DIRECTORY(@INVOICES_STG) limit 10;

-- Flatten the JSON results into a usable table
create or replace table invoices as
SELECT
    Predictions:ORDER_ID[0]:value::STRING AS ORDER_ID,
    Predictions:BILL_TO[0]:value::STRING AS BILL_TO,
    Predictions:SHIP_TO[0]:value::STRING AS SHIP_TO,
    Predictions:DATE[0]:value::STRING AS DATE,
    Predictions:INVOICE_NUMBER[0]:value::STRING AS INVOICE_NUMBER,
    Predictions:ITEM_NAME[0]:value::STRING AS ITEM_NAME,
    Predictions:QUANTITY[0]:value::STRING AS QUANTITY,
    Predictions:ITEM_RATE[0]:value::STRING AS ITEM_RATE,
    Predictions:SUBTOTAL[0]:value::STRING AS SUBTOTAL,
    Predictions:DISCOUNT_AMT[0]:value::STRING AS DISCOUNT_AMT,
    Predictions:ITEM_AMOUNT[0]:value::STRING AS ITEM_AMOUNT,
    Predictions:SHIPPING_AMT[0]:value::STRING AS SHIPPING_AMT,
    Predictions:TOTAL_AMOUNT[0]:value::STRING AS TOTAL_AMOUNT,
    Predictions:BALANCE_DUE[0]:value::STRING AS BALANCE_DUE,
    Predictions:SHIP_MODE[0]:value::STRING AS SHIP_MODE,
    Predictions:NOTES[0]:value::STRING AS NOTES,
    RELATIVE_PATH,
    FILE_URL
FROM

We first create a table to hold the raw predictions, then use another query to extract the key information (order ID, billing address, etc.) into a usable tabular format.

Creating Views for Analysis

Now, let's create a view to focus on high-value invoices shipped via "First Class":

SQL

CREATE OR REPLACE VIEW high_value_invoices AS
SELECT
    ORDER_ID,
    INVOICE_NUMBER,
    ITEM_NAME,
    TOTAL_AMOUNT,
    BALANCE_DUE,
    SHIP_MODE,
    QUANTITY,
    ITEM_RATE,
    SUBTOTAL,
    DISCOUNT_AMT,
    ITEM_AMOUNT,
    SHIPPING_AMT,
    BILL_TO,
    SHIP_TO,
    DATE,
    NOTES
FROM invoices
WHERE SHIP_MODE = 'First Class'
  AND CAST(REPLACE(TOTAL_AMOUNT, '$', '') AS DECIMAL) > 100

This view allows you to easily query and analyze these specific invoices.


Conclusion

Snowflake Document AI enables you to transform unstructured data into actionable insights. By leveraging zero-shot pre-trained models and custom classifiers, you can extract, organize, and analyze information from various document types, opening up a world of possibilities for your data-driven applications.

Resources

Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

Join Our Community

Join our newsletter list for occasional updates, products and insights.

More from
Gen AI/LLMs

Snowflake Embeddings and Vector Search

Snowflake Embeddings and Vector Search

Snowflake Embeddings and Vector Search

Snowflake Co-Pilot

Snowflake Co-Pilot

Snowflake Co-Pilot

Snowflake Universal Search

Snowflake Universal Search

Snowflake Universal Search

Streamlit Chatbot in Under 60 Seconds Using Snowflake Cortex

Streamlit Chatbot in Under 60 Seconds Using Snowflake Cortex

Streamlit Chatbot in Under 60 Seconds Using Snowflake Cortex

Snowflake Document AI (DocAI)

Snowflake Document AI (DocAI)

Snowflake Document AI (DocAI)

Snowflake Cortex Finetuning LLMs

Snowflake Cortex Finetuning LLMs

Snowflake Cortex Finetuning LLMs

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved

Copyright 2020-24 - All Right Reserved