Introduction
Most organizations have plenty of data, but getting a reliable answer still takes too long because the path from a business question to a queryable definition is full of friction. Stakeholders ask things like "What changed after the campaign?" or "Which customers are at risk?" and the data team has to translate that intent into the right tables, joins, filters, time windows, and metric definitions—while also navigating inconsistent naming, shifting schemas, duplicated "sources of truth," and access controls. The result is bottlenecks, repeated back-and-forth, slow decision cycles, and brittle reporting where two teams can answer the "same" question differently. In practice, the business problem is not a lack of data—it's the cost, latency, and inconsistency of turning everyday questions into governed, auditable, and repeatable answers at scale.
We built an agent that needed to answer business questions from a Snowflake database. The obvious approach was prompt engineering—give the LLM the schema of the analytical DB, some examples, and let it generate SQL. The first version worked surprisingly well on simple queries. Then a product manager asked: "What was the DAU over the last 3 months?" The agent confidently returned results. They were completely wrong—it had joined tables incorrectly and missed a critical filter. Not the agent's fault. It didn't know that that specific query meant that you wanted to filter users on a particular attribute and you were only allowed to count them as 'active' if certain conditions met.
This is the Text-to-SQL paradox: LLMs are remarkably good at generating syntactically correct SQL, but the gap between "correct syntax" and "correct answer" is where production systems fail. After a number of iterations for a client project, we found that the challenges go far deeper than prompt engineering. Real-world databases have cryptic column names that no LLM can interpret without help. Users expect consistent results but LLMs are inherently non-deterministic. Multi-tenant platforms serve customers who use identical schemas in completely different ways.
Interestingly, power users can even instinctively catch that the answer is incorrect. The existing dashboards from the big data era were built with multiple dev cycles between the business user and the data analyst, probably over many painstaking months if not years, but they had one thing going for them - they were correct in their answer and they were consistent in their data. Text-to-SQL indeed has a high bar to meet.
What's inside this report:
- Why naive Text-to-SQL implementations fail at enterprise scale, with concrete examples
- The cryptic schema problem: when
attr_47could mean anything - Non-determinism and the trust crisis: why the same query produces different charts
- Multi-tenancy challenges: one schema, infinite interpretations
- How reasoning-based agents with memory solve these problems
- A practical framework for production Text-to-SQL deployment
We'll walk through what we learned building Text-to-SQL for real business users—the failures, the breakthroughs, and the patterns that emerged. By the end, you'll have a clearer mental model for when Text-to-SQL makes sense, how to implement it safely, and what accuracy levels are actually achievable in production.

