Gen AI/LLMs
May 22, 2024
Snowflake Embeddings and Vector Search
Snowflake's advanced capabilities extend beyond traditional data warehousing and analytics. With the advent of Snowflake Cortex, powerful tools like embedding generation and vector search have become accessible for unlocking deeper insights from unstructured text data.
Understanding Embeddings
Embeddings are numerical representations of words or phrases in a high-dimensional space.
Take a list of Items for example. The corresponding embeddings captures different aspects of meaning in a higher dimensional space, allowing us to compare the semantic similarity between pieces of text. Snowflake's EMBED_TEXT_768 function, part of the Cortex suite, generates these embeddings using a pre-trained language model.
Hands-On Example: Everyday Objects
Let's explore embeddings with a simple example:
>_ SQL
This script creates a table EverydayObjects and populates it with 23 common objects.
Next, let's see how traditional string matching falls short in capturing semantic relationships:
String matching only works for exact or partial keyword matches. It cannot understand that "Pen" is related to "Writing" or that "Beaver" is an "animal."
Generating Embeddings
To address this, we'll use SNOWFLAKE.CORTEX.EMBED_TEXT_768 to create embeddings:
>_ SQL
This new table everydayobjects_embeddings contains the original objects along with their vector embeddings (a new data type in Snowflake).
Vector Similarity Search
Now, the magic happens. We can find semantically similar objects using VECTOR_COSINE_SIMILARITY:
>_ SQL
This will return the top 5 objects most closely related to the word 'tree' semantically.
This query returns objects most similar to "tree," considering their meaning rather than just their names.
Putting It Together: Document Chunking and Summarization
This concept extends to larger text documents. Let's create a DocumentChunks table and embed the chunks:
>_ SQL
We can then find the chunk most similar to a query and summarize it:
>_ SQL
This query will search the document chunks for text related to "sales growth in Q1", then return the original chunk along with a summary generated by Snowflake Cortex.
Key Takeaways:
Embeddings unlock the ability to search and analyze text based on meaning.
Snowflake Cortex makes it easy to generate and work with embeddings.
Vector similarity search enables finding semantically related information.
Applications range from simple object comparisons to complex document analysis and RAG based applications.
RESET Environment
To reset the environment for further experimentation or to clean up resources, you can drop the EmbeddingsDB database using the provided script.
>_ SQL
This will permanently delete all tables, data, and other objects created within this database. Remember, this action is irreversible, so use it with caution!
Snowflake Documentation & Resources:
Snowflake Cortex Overview: Large Language Model (LLM) Functions (Snowflake Cortex)
Snowflake Cortex Embeddings: Vector Embeddings | Snowflake Documentation
EMBED_TEXT_768 Function: EMBED_TEXT_768 (SNOWFLAKE.CORTEX)
VECTOR_COSINE_SIMILARITY Function: https://docs.snowflake.com/en/sql-reference/functions/vector_cosine_similarity
Other Resources:
Understanding Word Embeddings: Understanding Word Embeddings: The Building Blocks of NLP and GPTs
A Gentle Introduction to Vector Similarity Search: A Gentle Introduction to Vector Search - OpenDataScience.com