LLMs for Document Processing: What Actually Works (and What Breaks)

A practical look at using LLMs for document processing in production systems. What actually works, why tables destroy pipelines, and the hybrid architectures that survive.

Snehasish Konger

Founder & CEO

Technical Guide

Abstract art
Table of Content
No headings found on page

Share on social media

Everyone wants to rip out their legacy OCR pipelines right now. The pitch across teams is almost always the same. You take the messy PDFs, throw them at an LLM, prompt it for JSON, and you're done.

It works perfectly on the demo. The first ten test documents parse flawlessly.

Then you run a real production batch of ten thousand files. This is where things usually go wrong. You start seeing the actual limits of what language models can do with spatial data.

Here is what is actually working in production systems right now, and what just breaks.

Where the models actually work

LLMs are incredibly good at fuzzy extraction from unstructured blocks of text.

If a contract has a termination clause buried on page 14 under an unpredictable heading, an LLM will almost always find it and pull out the effective date. Old rule-based systems break the second the legal team changes a comma or uses a synonym. LLMs don't care about rigid keyword matching.

Classification is also solid. Routing an inbound document to the right processing queue based on its overall contents works reliably well.

Summarization is mostly fine, assuming you don't need strict factual audits of the summary.

That’s pretty much it for the reliable parts.

Where the pipelines fall apart

Tables.

Tables are an absolute nightmare. Teams drastically underestimate this part.

An LLM processes tokens linearly. It doesn't "see" grid lines the way a human does. Even the newer vision-language models still hallucinate heavily on dense, multi-page financial tables. If a cell spans two columns, or a row wraps awkwardly to the next page, the LLM will scramble the alignment. You end up with dollar amounts assigned to the wrong line items.

Then there's the parsing problem. You prompt the model to return strictly a JSON object. No markdown, no intro text, just the data.

But every thousand requests, the model decides to be helpful and outputs "Here is the JSON you requested:" right before the payload. Your downstream parser crashes. You end up writing regex just to clean up the LLM's output before your system can even read it (this usually breaks when rules start overlapping or the model gets a silent backend update).

The context window trap is another massive friction point.

Say you have a 300-page technical manual. You can't just shove it all in the prompt. Even if the model supports a massive context window, it costs too much per query and the latency spikes to 40 seconds.

So teams chunk it. They split the document into pages. But now the LLM is looking at page 45 and doesn't know what acronym was defined on page 2. Cross-document references fail entirely. You lose the global context.

The architecture that survives

The patterns that actually survive in production aren't pure LLM pipelines. They end up looking like a messy hybrid of old and new tech.

You still use standard, deterministic OCR tools to get the text and the layout coordinates. You use regex or standard code to grab the highly predictable stuff—dates, standardized headers, explicit invoice numbers.

You only invoke the LLM for the messy, unstructured fields that regex can't handle.

And you always wrap that LLM call in a retry loop.

It’s not as clean as the demos make it look. But you can actually maintain it. Debugging a regex rule that missed a field is annoying. Debugging a non-deterministic black box that hallucinated an extra zero on an invoice—just because the previous word was "approximately"—is nearly impossible.

FAQ

Frequently Asked Question

Have more questions? Don't hesitate to email us:

01

Can't we just use a massive context window model for everything?

Latency and compute costs. Shoving a 300-page manual into an LLM might work technically, but it takes 40 seconds to return. The API bills pile up. It's just not practical for batch processing tens of thousands of files.

02

Do vision models fix the table extraction problem?

03

What breaks first in production?

04

How are teams handling giant documents then?

05

So is traditional OCR actually dead?