Natural Language to SQL: A Production Guide for Enterprise Data Access

Our text-to-SQL agent answered its first hard question confidently and wrongly, and the fix had nothing to do with prompts. This guide covers what production text-to-SQL actually breaks on (column names that mean nothing, outputs that change between Mondays, tenants who use identical tables in incompatible ways) and the architecture that made it reliable: a semantic layer, pinned-query memory, and per-tenant context. Written for teams putting natural-language data access in front of real business users.

Introduction

We built an agent to answer business questions from a Snowflake database. The obvious approach was prompt engineering: give the LLM the schema of the analytical DB and some examples, and let it generate SQL. The first version worked surprisingly well on simple queries. Then a product manager asked: "What was the DAU over the last 3 months?" The agent confidently returned results. They were completely wrong. It had joined tables incorrectly and missed a critical filter. Not the agent's fault: nothing it could see said that this question meant filtering users on a particular attribute, or that you were only allowed to count someone as 'active' if certain conditions were met.

That gap is where production Text-to-SQL fails. LLMs are remarkably good at generating syntactically correct SQL, and syntactically correct SQL can still be the wrong answer. After a number of iterations on that client project, we found the challenges go far deeper than prompt engineering. Real-world databases have cryptic column names that no LLM can interpret without help. Users expect consistent results, and LLMs are inherently non-deterministic. Multi-tenant platforms serve customers who use identical schemas in completely different ways. What fixed all three was a semantic layer, a memory of pinned queries, and per-tenant context. Better prompts never would have.

Interestingly, power users can even instinctively catch that the answer is incorrect. The existing dashboards from the big data era were built with multiple dev cycles between the business user and the data analyst, probably over many painstaking months if not years, but they had one thing going for them: they were correct in their answer and they were consistent in their data. Text-to-SQL indeed has a high bar to meet.

This guide is what we learned building Text-to-SQL for real business users: where naive implementations break, what a semantic layer actually contains, how pinned queries fix non-determinism, and what changes when every tenant uses the same schema differently. If you're deciding whether to build one of these systems, or debugging one that keeps being confidently wrong, this is the map we wish we'd had.

References

SWE-bench Verified Leaderboard ↗

Website2025

[1] The official SWE-bench leaderboard tracking model performance on real-world software engineering tasks. Claude Opus 4.5 achieved 80.9% and GPT-5.2 achieved 80.0% on SWE-bench Verified as of December 2025.

Enterprise-grade Natural Language to SQL Generation Using LLMs ↗

AWS Machine Learning Team2024

AWS's comprehensive guide to building production Text-to-SQL systems, covering domain mapping, schema scoping, and balancing accuracy with latency at enterprise scale.

Techniques for Improving Text-to-SQL ↗

Google Cloud Team2024

Google Cloud's analysis of text-to-SQL techniques, including context building, few-shot prompting, and validation strategies for production deployments.

From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems ↗

Various Authors2024

Comprehensive academic review of LLM-based text-to-SQL approaches, covering architectures, benchmarks, and the evolution of the field.

Spider 2.0: Enterprise Text-to-SQL Benchmark ↗

Website2024

The next-generation Spider benchmark featuring 632 real-world text-to-SQL workflows derived from enterprise use cases, representing a significant increase in difficulty over the original Spider dataset.

BIRD-SQL Benchmark ↗

Website2024

Large-scale benchmark with 12,751 text-SQL pairs across 95 real-world databases totaling 33.4 GB, designed to bridge the gap between academic research and production applications.

Text-to-SQL Benchmarks and the Current State-of-the-Art ↗

Ainesh Pandey2024

Analysis of text-to-SQL benchmark performance, comparing Spider, BIRD, and real-world accuracy across different LLM approaches.

LLM Text-to-SQL Solutions: Top Challenges and Tips ↗

K2view2024

Practical guide to common Text-to-SQL challenges including faulty joins, aggregation mistakes, missing filters, and strategies for addressing them in enterprise environments.

What is Data Democratization? Definitions and Best Practices ↗

Alation2024

Overview of data democratization principles and how text-to-SQL technologies fit into broader enterprise data governance strategies.

Natural Language to SQL: A Production Guide for Enterprise Data Access

Introduction

Sign in to continue reading

References

Rahul Parundekar

Harris Brown

Share on LinkedIn

Share on X

Schedule a Chat

About A.I. Hero, Inc.

Follow on LinkedIn

Follow on X