NL-to-SQL in Production: What Goes Wrong and How to Fix It

Natural language to SQL is one of the most immediately useful applications of large language models for data teams. Analysts who previously needed a SQL engineer to retrieve data can now ask a question in plain English and get a result. The productivity gains are real.

So is the failure surface.

After building and maintaining NL-to-SQL pipelines in production environments, here are the failure modes I encounter most often, and what I have learned about addressing them.

Failure Mode 1: Schema Ambiguity

LLMs generate SQL based on their understanding of your schema. If your schema is ambiguous, your generated queries will be too.

This shows up as: table names that do not communicate their contents, column names like status that appear in twelve tables with different values, foreign key relationships that are not documented anywhere the model can access.

Fix: Curate a schema context document. This is not your full DDL. It is a structured description of your key tables, their intended use, the meaning of critical columns, and any join patterns that are non-obvious. Treat it as a first-class artifact. Keep it updated when the schema changes.

Failure Mode 2: Silent Incorrectness

The most dangerous NL-to-SQL outputs are not the ones that throw a syntax error. They are the ones that run, return results, and are wrong.

A query that returns 12,000 rows when the correct answer is 3 rows, and nobody notices.

This happens because:

- Analysts trust model outputs, especially when they look reasonable

- There is no automatic validation layer

- Edge cases in the data (nulls, duplicates, date ranges) produce subtly incorrect aggregations

Fix: Build an evaluation layer. For each query type your system handles, maintain a set of reference queries with known-correct outputs. Run new model-generated queries against the same inputs and compare. Flag discrepancies for human review. This is not a perfect solution, but it dramatically reduces the silent incorrectness problem.

Failure Mode 3: Scope Creep Under Adversarial Prompting

Users will inevitably ask questions your system was not designed to answer. Some will do so intentionally (curiosity, testing limits). Some will do so inadvertently (they do not know what data is available).

The result is queries that try to access tables outside the intended scope, join across systems in ways that violate data governance rules, or retrieve PII that should not be exposed to the requesting user.

Fix: Implement schema-level access control as a hard constraint, not a model instruction. The model should only ever see the tables it is authorized to use. Use a schema filtering layer that restricts the context provided to the model based on the requesting user's role. Do not rely on prompt instructions alone to enforce access boundaries.

Failure Mode 4: Brittle Prompt Engineering

Early NL-to-SQL implementations often work well in development and degrade in production as the prompt becomes a long, fragile concatenation of examples, instructions, and schema information.

When you add a new table, you update the prompt. When a column changes, you update the prompt. Eventually the prompt is 4,000 tokens of carefully ordered text that breaks whenever anything changes.

Fix: Separate concerns. The schema context should be dynamically constructed from a managed schema registry, not embedded directly in the prompt. Examples should be stored in a retrieval system and selected based on semantic similarity to the incoming question, not hardcoded. Your prompt template should be minimal and stable.

Failure Mode 5: No Observability

You ship the pipeline. Analysts start using it. Six months later you have no idea which queries are being asked, which ones are failing, which ones are producing results that users find useful, or how usage has changed over time.

Without observability, you cannot improve the system. You cannot catch regressions. You cannot demonstrate value.

Fix: Log everything at query time: the natural language input, the generated SQL, the execution result (success or error), response latency, and a user feedback signal (even a simple thumbs up/down). Build a dashboard that surfaces usage trends, error rates, and feedback distribution. Review it weekly.

The Common Thread

These failure modes share a root cause: NL-to-SQL was treated as a model problem when it is actually a systems problem. The model is one component. The schema context, access controls, evaluation layer, and observability infrastructure are the other components, and they matter just as much.

Get the model right and get everything else wrong, and you will have a system that is impressive in a demo and unreliable in production.

Get the systems right and the model becomes much easier to improve, swap out, and maintain over time.