Gen AI/LLMs
Jun 1, 2024
Snowflake Document AI (DocAI)
In today's data-driven landscape, a staggering 80% of enterprise data is unstructured, residing in formats like PDFs, images, and emails. This vast reservoir of untapped information often sits idle, as traditional data processing tools struggle to extract meaningful insights from its unorganized nature.
From invoices and purchase orders in the retail sector to patient records in healthcare and equipment logs in manufacturing, organizations across industries grapple with the challenge of extracting value from unstructured data. The inability to easily search, analyze, and integrate this data with structured information creates a significant bottleneck, hindering decision-making and stifling innovation.
Consider, for example, the invoice depicted above. It contains critical information such as customer details, order items, and financial figures. However, without specialized tools, extracting this data would require time-consuming manual efforts, prone to errors and delays. This is a challenge faced by countless businesses across various sectors.
Setting the Stage
Before we dive into Document AI, let's set up our environment:
>_ SQL
We'll work within a dedicated database called invoicesdb to keep things organized.
Access Control for Document AI
To enable Document AI features, we need to grant the necessary permissions to the role or user that will be working with it:
>_ SQL
Now, the user demo_user (or your username) can harness the full power of Document AI.
Storing and Preparing Invoice Data
For this demonstration, we'll be using a set of sample invoices obtained from the GitHub repository https://github.com/femstac/Sample-Pdf-invoices
. These invoices vary in layout and format, allowing us to explore how Snowflake Document AI can adapt to different document structures. (downloaded: Feb 7, 2024)
>_ SQL
This will create the external stage where your raw data will reside, and the internal stage where you will stage the data for processing.
Training Your Customer AI Models in Snowsight
While Snowflake Document AI offers a powerful pre-trained model, fine-tuning it to your specific document types and extraction needs is crucial for optimal performance. Snowsight, Snowflake's intuitive web interface, provides a streamlined way to train, version and publish your models to look the one illustrated below.
Steps for Training:
Access Document AI: In Snowsight, navigate to AI & ML > Document AI.
Select Your Model: Choose the model you want to train from the list of existing builds.
Upload Documents: Upload a set of representative documents for training. Ensure these documents are labeled with the correct information you want to extract.
Label Fields (Optional): If your model requires more specific training, manually label the relevant fields in the documents within Snowsight. This helps the model learn the patterns and context for accurate extraction.
Start Training: Click the "Train Model" button. Snowflake will process your labeled documents and fine-tune the model's parameters.
Monitor Progress: Monitor the training progress in Snowsight. This typically involves evaluating the model's performance on a separate set of validation documents.
Iterate (Optional): If the model's accuracy isn't satisfactory, you can iterate on the training process by adding more labeled documents, refining labels, or adjusting model parameters.
Prompt Design for Document AI
Defining the right prompts is crucial for accurate and effective data extraction. You can use either:
Direct Prompts: Concise questions like "What is the bill to name?" or "What's the Invoice Number?"
Elaborate Prompts: More natural language-oriented questions like "Who is the bill being sent to?" or "What is the reference number for this invoice?"
Here's a table summarizing the prompts you can use to extract data from invoices:
Testing Document AI Predictions
With the data in place, you can test Document AI's predictions on a few sample invoices:
>_ SQL
This is just a test of the out of the box model. This query calls the pre-trained model (INVOICESDB.PUBLIC.DAI) on the first two invoices in your stage, returning the extracted fields in JSON format.
Storing and Processing Predictions
SQL
We first create a table to hold the raw predictions, then use another query to extract the key information (order ID, billing address, etc.) into a usable tabular format.
Creating Views for Analysis
Now, let's create a view to focus on high-value invoices shipped via "First Class":
SQL
This view allows you to easily query and analyze these specific invoices.
Conclusion
Snowflake Document AI enables you to transform unstructured data into actionable insights. By leveraging zero-shot pre-trained models and custom classifiers, you can extract, organize, and analyze information from various document types, opening up a world of possibilities for your data-driven applications.
Resources
Snowflake Document AI Documentation:https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/overview
Sample PDF Invoices: https://github.com/femstac/Sample-Pdf-invoices