Learn more
AI

Unlocking Unstructured Data: Inside the Warehouse, Not Outside It

Unstructured data no longer has to live outside your analytics stack. This post shows how Snowflake’s Cortex AISQL and BigQuery’s AI.GENERATE/AI.GENERATE_TABLE bring generative and semantic functions directly into SQL—so you can filter, classify, summarize, transcribe, and join text (and images/audio) with your facts inside the warehouse. We walk through a practical support-tickets example—topic labeling, one-line summaries, and joining to transactions—to reveal product drivers and sentiment patterns without ETL sprawl. You’ll also get pragmatic guidance on prompt templates, validation and tests, gradual rollout, cost/perf tuning, and governance via your semantic layer. The result: richer insight, less plumbing, and AI you can trust in production.

Britton Stamper
October 12, 2025
Unlocking Unstructured Data: Inside the Warehouse, Not Outside It
Table of Contents

For decades, analytics and BI operated on a simple premise: your data must be structured, clean, and neatly tabular. Data engineers built pipelines to normalize, join, and aggregate, and analysts consumed those models in dashboards and SQL tools. That paradigm delivered enormous value—but it also created blind spots.

Today, organizations are drowning in unstructured or semi-structured data: support tickets, product reviews, email threads, scanned invoices, design files, meeting transcripts, and more. Most of this lives next to your structured data—sometimes in the same warehouse or accompanying storage systems—but it’s rarely exploited to its full potential. Until now.

Thanks to the rise of “AISQL” and generative AI inference inside the warehouse, analytics teams no longer need to export, stage, or build complex pipelines just to surface insights from unstructured content. You can now query, filter, classify, transform, and join unstructured data with structured data—all in SQL. Below, I walk through what’s available today in Snowflake and BigQuery, how it’s already being used, and how analytics engineers can make it part of their toolkit.

AISQL in Action: Snowflake’s Cortex AISQL

Snowflake’s Cortex AISQL (currently in public preview) brings LLM-powered functions right into your SQL query engine.   With these functions, you can operate over text, images, or (soon) audio—all without leaving Snowflake.

Some of the key operators include:

  • AI_FILTER — a boolean predicate over text or image columns (you can use it in WHERE, or JOIN … ON) to filter rows semantically.
  • AI_CLASSIFY — assign labels or categories to input text or images (including multi-label classification)
  • AI_COMPLETE — generate or “complete” text (or image + prompt) using an LLM, useful for summarization, expansions, etc.
  • AI_AGG / AI_SUMMARIZE_AGG — aggregate across rows of unstructured content: e.g. summarizing all reviews for a product in one sentence.
  • AI_TRANSCRIBE (preview) — transcribe audio files (e.g. call recordings) into text inside SQL, enabling speech-to-text workflows.

Because these functions are integrated into SQL, you can chain them with your existing queries—JOIN, GROUP BY, WINDOW functions, etc. That means you can combine insights from unstructured and structured data in one unified pipeline.

BigQuery: AI.GENERATE_TABLE, AI.GENERATE, and Generative SQL

Google is also embedding generative AI directly into BigQuery via its integration with Vertex AI.

Two standout functions are:

  • AI.GENERATE_TABLE — takes unstructured content as input plus a schema prompt and returns a structured table based on the model output. You can use it for classification tasks, summarization, or schema mapping of text content.
  • AI.GENERATE — functions more like a row-wise generative text operator. It lets you run generative tasks (summaries, classification, etc.) per row in a query.

BigQuery also supports ML.GENERATE_TEXT (and embedding generation) via remote models (e.g. Vertex Gemini or partner LLMs).

These capabilities allow you to transform free-text content into structured, SQL-queryable features within BigQuery — without needing to export out to external NLP pipelines.

Example Use Case: Enriching Customer Feedback

Let’s walk through a concrete scenario to bring this home.

Suppose you have a table of support tickets:

ticket_idcustomer_idticket_textcreated_at

You also have structured tables like transactions and users.

You want to:

  1. Classify each ticket into topics (e.g. “billing”, “quality issue”, “refund request”).
  2. Summarize the ticket text to a short executive summary.
  3. Join those classifications back into the transactions table to see if certain product lines trigger more “quality issue” tickets.

With Snowflake’s AISQL, you might write:

SELECT
 t.ticket_id,
 AI_CLASSIFY(t.ticket_text, ['billing','quality','refund']) AS issue_category,
 AI_COMPLETE(t.ticket_text, 'Summarize this support request in one sentence') AS summary
FROM support_tickets t
WHERE AI_FILTER(t.ticket_text, 'mentions “refund” or “return”')

Then you might join that to structured tables:

SELECT
 u.user_name,
 tx.product_id,
 issue_category,
 summary
FROM transactions tx
JOIN users u ON tx.user_id = u.user_id
JOIN (
 SELECT ticket_id, issue_category, summary
 FROM support_tickets
) t2 ON tx.ticket_id = t2.ticket_id

In BigQuery, you could use AI.GENERATE_TABLE to produce a derived table from ticket_text:

SELECT *
FROM AI.GENERATE_TABLE(
 MODEL `myproject.mydataset.my_gemini_model`,
 (
   SELECT ticket_id, ticket_text
   FROM support_tickets
 ),
 STRUCT(
   'issue_category STRING, summary STRING' AS output_schema,
   512 AS max_output_tokens
 )
) AS derived

Then you join that derived table into your analytical models.

By bridging unstructured context (ticket text) with structured tables (transactions, users), you unlock deeper insight: why certain product lines are seeing more support issues, or how sentiment correlates with product features.

Best Practices & Pitfalls

  • Watch prompt stability & consistency: Use template prompts and guardrails (max length, context windows) so that your AI outputs are consistent and testable.
  • Validate & test: Build tests against AI output (e.g. spot-check classifications, verify no nulls where unacceptable) before promoting into production.
  • Gradual rollout: Start with a narrow slice or product line before scaling across your entire warehouse.
  • Monitor performance & cost: AISQL functions (especially over many rows) can be expensive or slow—batch wisely, filter early, and optimize queries.
  • Govern through semantic layers: Surface AI-derived fields (labels, summaries) through your semantic layer so downstream tools trust them and don’t bypass definitions.
We're here to help!

Get the Semantic Layer Guide!

Everything that a data leader needs to understand and deploy metrics at scale

Download The Full Guide

Core Semantic Layer Concepts

Benefits and ROI

Implementation Steps

Get started with the next generation of data applications

Create an account to connect your business and elevate how your operate.

ABOUT THE AUTHOR
Britton Stamper

Britton is the CTO of Push.ai and oversees Product, Design, and Engineering. He's been a passionate builder, analyst and designer who loves all things data products and growth. You can find him reading books at a coffee shop or finding winning strategies in board games and board rooms.

Enjoyed this read?

Stay up to date with the latest product updates and insights sent straight to your inbox!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.