Unlocking Unstructured Data: Inside the Warehouse, Not Outside It

For decades, analytics and BI operated on a simple premise: your data must be structured, clean, and neatly tabular. Data engineers built pipelines to normalize, join, and aggregate, and analysts consumed those models in dashboards and SQL tools. That paradigm delivered enormous value—but it also created blind spots.

Today, organizations are drowning in unstructured or semi-structured data: support tickets, product reviews, email threads, scanned invoices, design files, meeting transcripts, and more. Most of this lives next to your structured data—sometimes in the same warehouse or accompanying storage systems—but it’s rarely exploited to its full potential. Until now.

Thanks to the rise of “AISQL” and generative AI inference inside the warehouse, analytics teams no longer need to export, stage, or build complex pipelines just to surface insights from unstructured content. You can now query, filter, classify, transform, and join unstructured data with structured data—all in SQL. Below, I walk through what’s available today in Snowflake and BigQuery, how it’s already being used, and how analytics engineers can make it part of their toolkit.

AISQL in Action: Snowflake’s Cortex AISQL

Snowflake’s Cortex AISQL (currently in public preview) brings LLM-powered functions right into your SQL query engine. With these functions, you can operate over text, images, or (soon) audio—all without leaving Snowflake.

Some of the key operators include:

AI_FILTER — a boolean predicate over text or image columns (you can use it in WHERE, or JOIN … ON) to filter rows semantically.
AI_CLASSIFY — assign labels or categories to input text or images (including multi-label classification)
AI_COMPLETE — generate or “complete” text (or image + prompt) using an LLM, useful for summarization, expansions, etc.
AI_AGG / AI_SUMMARIZE_AGG — aggregate across rows of unstructured content: e.g. summarizing all reviews for a product in one sentence.
AI_TRANSCRIBE (preview) — transcribe audio files (e.g. call recordings) into text inside SQL, enabling speech-to-text workflows.

Because these functions are integrated into SQL, you can chain them with your existing queries—JOIN, GROUP BY, WINDOW functions, etc. That means you can combine insights from unstructured and structured data in one unified pipeline.

BigQuery: AI.GENERATE_TABLE, AI.GENERATE, and Generative SQL

Google is also embedding generative AI directly into BigQuery via its integration with Vertex AI.

Two standout functions are:

AI.GENERATE_TABLE — takes unstructured content as input plus a schema prompt and returns a structured table based on the model output. You can use it for classification tasks, summarization, or schema mapping of text content.
AI.GENERATE — functions more like a row-wise generative text operator. It lets you run generative tasks (summaries, classification, etc.) per row in a query.

BigQuery also supports ML.GENERATE_TEXT (and embedding generation) via remote models (e.g. Vertex Gemini or partner LLMs).

These capabilities allow you to transform free-text content into structured, SQL-queryable features within BigQuery — without needing to export out to external NLP pipelines.

Example Use Case: Enriching Customer Feedback

Let’s walk through a concrete scenario to bring this home.

Suppose you have a table of support tickets:

ticket_idcustomer_idticket_textcreated_at

You also have structured tables like transactions and users.

You want to:

Classify each ticket into topics (e.g. “billing”, “quality issue”, “refund request”).
Summarize the ticket text to a short executive summary.
Join those classifications back into the transactions table to see if certain product lines trigger more “quality issue” tickets.

With Snowflake’s AISQL, you might write:

SELECT t.ticket_id, AI_CLASSIFY(t.ticket_text, ['billing','quality','refund']) AS issue_category, AI_COMPLETE(t.ticket_text, 'Summarize this support request in one sentence') AS summary FROM support_tickets t WHERE AI_FILTER(t.ticket_text, 'mentions “refund” or “return”')

Then you might join that to structured tables:

SELECT u.user_name, tx.product_id, issue_category, summary FROM transactions tx JOIN users u ON tx.user_id = u.user_id JOIN ( SELECT ticket_id, issue_category, summary FROM support_tickets ) t2 ON tx.ticket_id = t2.ticket_id

In BigQuery, you could use AI.GENERATE_TABLE to produce a derived table from ticket_text:

SELECT * FROM AI.GENERATE_TABLE( MODEL `myproject.mydataset.my_gemini_model`, ( SELECT ticket_id, ticket_text FROM support_tickets ), STRUCT( 'issue_category STRING, summary STRING' AS output_schema, 512 AS max_output_tokens ) ) AS derived

Then you join that derived table into your analytical models.

By bridging unstructured context (ticket text) with structured tables (transactions, users), you unlock deeper insight: why certain product lines are seeing more support issues, or how sentiment correlates with product features.

Best Practices & Pitfalls

Watch prompt stability & consistency: Use template prompts and guardrails (max length, context windows) so that your AI outputs are consistent and testable.
Validate & test: Build tests against AI output (e.g. spot-check classifications, verify no nulls where unacceptable) before promoting into production.
Gradual rollout: Start with a narrow slice or product line before scaling across your entire warehouse.
Monitor performance & cost: AISQL functions (especially over many rows) can be expensive or slow—batch wisely, filter early, and optimize queries.
Govern through semantic layers: Surface AI-derived fields (labels, summaries) through your semantic layer so downstream tools trust them and don’t bypass definitions.