Measuring OCR accuracy for a 1994 service manual RAG


A torque figure with a missing digit is not a typo. It is a mechanic over-tightening a brake bolt. When you turn a scanned service manual into a RAG chatbot, OCR fidelity stops being a quality nicety and becomes a safety property.

The chatbot in question reads a 1994 Yamaha XV250 Virago service manual: 291 pages, scanned to an image-only PDF, no text layer at all. It is live here: virago.edestudio.us . Ask it about valve clearances or jet sizes and it answers from the OCR’d corpus, reading the retrieved chunk text verbatim. Whatever the OCR got wrong is what the rider is told.

Context: Chunking and embedding technical documentation for RAG · live manual chat: virago.edestudio.us


A spec manual is mostly numbers: torque settings, valve clearances, fluid capacities, jet sizes, tyre pressures. An OCR error in prose is annoying. An OCR error in a figure is dangerous, and it poisons RAG twice over:

  • Retrieval. Garbled text embeds into the wrong neighbourhood, so the right chunk never gets retrieved.
  • Generation. The model reads the retrieved chunk verbatim. There is no second source to cross-check.

So “good enough to read” is not the bar. The bar is “faithful enough to act on.” That is the thing I wanted to measure, not guess at.

That page is the whole problem in one image. It is a picture. Every digit on it has to be recovered by an OCR engine before the chatbot can read it back to you, and the engine choice is where the fidelity is won or lost.


I had three ways to turn that scan into text, and they are easy to confuse, so here they are plainly:

Version Pipeline What it produces
v1 Tesseract OCR (--oem 1 --psm 1) A searchable PDF: the page image with an invisible Tesseract text layer overlaid
v2 Docling running its own EasyOCR, plus table-structure Markdown. This is the corpus currently indexed and live on the site
v3 Docling with OCR turned off, reusing the v1 Tesseract layer Markdown, but built from Tesseract text instead of EasyOCR

v3 is the interesting one and it came out of writing this post. v1 has good characters but no structure (it is a flat PDF text layer). v2 has structure but EasyOCR damage. So the obvious question: what if Docling does the layout and table work but reads the text that Tesseract already put in the PDF, instead of re-running OCR with EasyOCR?

The change is one flag:

# v2 (live): Docling runs its own EasyOCR over the page images
ocr = EasyOcrOptions(force_full_page_ocr=True)
po = PdfPipelineOptions(do_ocr=True, ocr_options=ocr, do_table_structure=True)

# v3: Docling reuses the Tesseract text already embedded in the PDF
po = PdfPipelineOptions(do_ocr=False, do_table_structure=True)

No vibes. I built a ground-truth set by reading the scanned spec, maintenance and torque pages by eye and transcribing 100 known values across 14 categories: model codes, dimensions, weights, capacities, engine specs, valve and cam internals, clearances, ratios, pressures, carburetor jets, suspension, brakes, electrical bulbs and torque settings.

The metric is deliberately strict: exact match. For each value, does its correct rendering, digits and unit together, appear in that corpus? Pass or fail. 249 cm³ is not “95 percent right” when it comes out as 249 cm'. It is wrong. Edit-distance scoring would have flattered every engine and hidden exactly the damage that matters.

Matching normalises whitespace only. It never changes a digit or a unit glyph, so a value that is wrong in a corpus stays wrong in the score.

Here is the whole experiment in one frame: 100 ground-truth values down the rows, grouped into their 14 categories, and the three pipelines across the columns. Green is an exact match, red is a value that came out wrong or never appeared. Read a column top to bottom and you are reading that pipeline’s entire report card.

The eye does the aggregation before any number is quoted: the middle column (the live corpus) is visibly redder than its neighbours, and the bottom band (torque) is red in all three. The rest of this post is just putting numbers on what that picture already shows.


Corpus Exact-match fidelity
v1 (Tesseract PDF) 83%
v2 (live EasyOCR Markdown) 61%
v3 (Tesseract + Docling, no EasyOCR) 82%

The corpus currently serving the chatbot is the worst of the three. And v3 recovers almost all of the gap: reusing the Tesseract layer brings fidelity back to the v1 ceiling while keeping the markdown structure that retrieval wants. That is a 21 point swing from a single pipeline flag.

The swing is not uniform. This is the per-category difference between v3 and the live v2 corpus, in percentage points. Positive (green) is where dropping EasyOCR for the Tesseract layer recovers values; negative (red) is the small price paid for it.

Two categories go backwards. Dimensions loses 14 points because EasyOCR happened to read one ground-clearance figure that Tesseract fumbled, and torque loses 30 because v2’s three lucky torque matches do not survive into v3. Everything else is a recovery, and the model-code and maintenance-table gains are the bulk of the 21 point overall lift.


The overall number hides the interesting part. Broken down by category, the failures are not spread evenly. They cluster exactly where EasyOCR struggles: model codes, and dense multi-row maintenance tables.

The fastest way to see the structure is a heatmap. Rows are the 14 categories (with their value counts), columns are the three pipelines, and the colour is exact-match fidelity from red (0 percent) to green (100). The v1 and v3 columns are nearly identical walls of green; v2 is where the red and amber appear.

Two patterns jump out of the colour alone. The vertical amber-and-red stripe down the v2 column is the EasyOCR tax. The horizontal red band across the torque row, present in every column, is a different problem entirely, one no OCR engine touches. Hold that second pattern; it is the subject of the honesty section below. Here is the same data as a grouped bar chart, if you prefer reading heights to colours:

Category Values v1 v2 (live) v3
model_code 5 100% 20% 100%
capacities 5 100% 40% 100%
clearances 5 100% 40% 100%
engine_internal 19 100% 52.6% 100%
brakes 10 100% 70% 100%
carburetor 9 100% 88.9% 100%
electrical 5 100% 80% 100%
chassis 8 87.5% 75% 87.5%
dimensions 7 85.7% 100% 85.7%
ratios 5 100% 100% 100%
engine 4 50% 50% 50%
pressures 4 50% 50% 50%
weights 4 50% 50% 50%
torque 10 10% 30% 0%

The 19-value engine_internal category is the clearest signal: valve, cam, cylinder and rocker dimensions are 100 percent in v1 and v3, and 52.6 percent in the live v2 corpus. Those are tightly packed maintenance tables, and EasyOCR plus Docling cram and mangle them. This is the kind of page those 19 values come off, a valve and valve-guide specification table with IN and EX columns, limits, and four-decimal millimetre figures packed two and three to a row:

Every figure on that page is four significant digits sitting next to another four-digit figure, with the only thing telling 6.975 ~ 6.990 apart from the EX column beside it being its horizontal position. That is exactly the layout EasyOCR-into-Docling collapses, and exactly why engine_internal halves in the live corpus.

Here is what “EasyOCR damage” actually looks like at the character level, true value against what the live corpus produced:

True value v2 (live) reads Failure
XV250U XV2SOU digit to letter: 5 to S, 0 to O
249 cm³ 249 cm' superscript dropped
11 kg/cm² 11 kg/cm? superscript to question mark
302 lb 302 Ib l to capital I
65W/60W 65W/6oW 0 to lowercase o
1st 2nd 3rd 4th Ist 2nd 3rd Ath ordinals mangled
0.6 ~ 0.7 mm 0.6 0.7 mm range separator dropped

The colour coding in that figure is the whole diagnostic. The blue failures are character-level: the engine read the right cell but mapped a glyph wrong, and a better OCR engine fixes them. The red one is layout-level: the digits are all correct but Docling’s table model scattered them, and a better OCR engine does nothing for it. The model code one is not cosmetic either. A rider asking the chatbot to confirm their model gets told XV2SOU, which is not a real Yamaha code.


This is the part the headline number would let you skip. v1 and v3 are clearly better than the live corpus, but they are not clean. Three categories sit at 50 percent across all three engines, and torque sits near the floor for everyone.

  • Superscript units fail everywhere. cm³ and kg/cm² come out as cm' and kg/cm? in every pipeline, because Tesseract misreads them too. v3 inherits that ceiling.
  • lb becomes Ib everywhere. Both engines read the lowercase L as a capital I.
  • Torque tables are scrambled in all three. The usable unit of a torque spec is the triple (58 Nm, 5.8 m-kg, 42 ft-lb) tied to its bolt. Docling’s table model crams those columns, so the triple survives almost nowhere. v3 still inherits this, because the cramming is the table-structure model’s doing, not the OCR engine’s.

This is the page the torque values come from, and you can see why a table model struggles with it. Sixty-odd rows, each a part name and a thread size followed by the same value printed three ways (Nm, then m-kg, then ft-lb), with a “Remarks” column that is empty for most rows and the page itself split into two stacked sub-tables:

The information a mechanic needs (this bolt, this torque) lives in the horizontal adjacency of four cells. When the table model mis-maps a single column boundary, the triple that should read 58 / 5.8 / 42 next to “Front wheel axle” gets split across rows, and the value the chatbot retrieves is no longer tied to its bolt. That is a layout failure, not a reading failure, which is why every OCR engine here inherits it.

So the win from v3 is specific and worth naming precisely: it fixes the character-level damage (model codes, dropped separators, glyph swaps) by using a better OCR engine’s text. It does not fix the layout-level damage (crammed torque tables), because that is Docling, not EasyOCR. Knowing which failure belongs to which stage is the whole point of testing instead of assuming.


The takeaway is small and practical: the OCR engine you skip matters more than the one you run. The live chatbot is serving the weakest corpus, and a better one already exists as a byproduct of the Tesseract pass that was done for a different reason. Promoting v3 (section-split the markdown, re-index the Cloudflare AI Search instance) would lift the corpus from 61 to 82 percent fidelity on the values that matter, at the cost of one re-index.

The torque tables need a separate fix, probably a layout-aware pass or hand-correction of the tightening-torque pages, because no OCR swap will un-cram them. And until the superscript and torque problems are solved, the chatbot’s system prompt earns its keep: it is told to flag OCR-ambiguous, safety-critical values and advise checking against the PDF.

The test itself is in the repo: a ground-truth JSON, a strict matcher, and a script that scores all three corpora and draws these charts. When the corpus changes, the number moves, and I can see it rather than hope.

Go break it yourself: virago.edestudio.us . Ask for a torque setting and a model code, then check the answer against the scan. That gap, in one question, is what this whole post is about.

×
Page views: