Measuring OCR accuracy for a 1994 service manual RAG

A torque figure with a missing digit is not a typo. It is a mechanic over-tightening a brake bolt. When you turn a scanned service manual into a RAG chatbot, OCR fidelity stops being a quality nicety and becomes a safety property.

The chatbot in question reads a 1994 Yamaha XV250 Virago service manual: 291 pages, scanned to an image-only PDF, no text layer at all. It is live here: virago.edestudio.us . Ask it about valve clearances or jet sizes and it answers from the OCR’d corpus, reading the retrieved chunk text verbatim. Whatever the OCR got wrong is what the rider is told.

Context: Chunking and embedding technical documentation for RAG · live manual chat: virago.edestudio.us

Why the numbers are the product ¶

A spec manual is mostly numbers: torque settings, valve clearances, fluid capacities, jet sizes, tyre pressures. An OCR error in prose is annoying. An OCR error in a figure is dangerous, and it poisons RAG twice over:

Retrieval. Garbled text embeds into the wrong neighbourhood, so the right chunk never gets retrieved.
Generation. The model reads the retrieved chunk verbatim. There is no second source to cross-check.

So “good enough to read” is not the bar. The bar is “faithful enough to act on.” That is the thing I wanted to measure, not guess at.

A scanned page from the 1994 Yamaha XV250 Virago service manual showing the General Specifications table for the XV250U and XV250UC models. Model codes, dimensions, weights, engine displacement, oil capacities and fuel tank figures are laid out in a two-column table. The page is a grayscale 1994 photocopy-quality scan with no selectable text.

That page is the whole problem in one image. It is a picture. Every digit on it has to be recovered by an OCR engine before the chatbot can read it back to you, and the engine choice is where the fidelity is won or lost.

Four local pipelines ¶

The manual arrives as a pure image scan, so something has to read the pixels. These are local pipelines on purpose: no document leaves the machine, no hosted API, no per-page cost. I had four ways to turn the scan into text, and they are easy to confuse, so here they are plainly:

Pipeline	What it produces
Docling + EasyOCR	Markdown. Docling runs its own EasyOCR plus table-structure. This was the corpus first indexed and live on the site.
Tesseract PDF	A searchable PDF: the page image with an invisible Tesseract text layer overlaid (`--oem 1 --psm 1`).
Docling + Tesseract layer	Markdown, but with Docling’s OCR turned off, reusing the Tesseract text layer from the PDF instead of EasyOCR.
Docling + RapidOCR	Markdown. Docling’s layout and table model with RapidOCR reading the pixels. This is what ships now.

Two of these are about the same idea from different angles. “Docling + Tesseract layer” came out of asking: what if Docling does the layout and table work but reads text Tesseract already put in the PDF, instead of re-running OCR with EasyOCR? That recovers a lot, but it depends on a Tesseract pass having been run first. “Docling + RapidOCR” asks the more direct question: keep Docling doing genuine OCR, but swap the weak EasyOCR engine for a stronger one. No pre-baked text layer required — which matters, because the next manuals in the queue (a Honda VFR750F, a Honda Monkey) are the same kind of image-only scan, and a pipeline that depends on a separate Tesseract step is a pipeline with an extra thing to forget.

The OCR-engine swap is one option object:

# live (worst): Docling runs its own EasyOCR over the page images
ocr = EasyOcrOptions(force_full_page_ocr=True)
po = PdfPipelineOptions(do_ocr=True, ocr_options=ocr, do_table_structure=True)

# shipped: Docling runs RapidOCR over the same page images
ocr = RapidOcrOptions(backend="torch")
po = PdfPipelineOptions(do_ocr=True, ocr_options=ocr, do_table_structure=True)

# the no-OCR variant: reuse the Tesseract layer already in the PDF
po = PdfPipelineOptions(do_ocr=False, do_table_structure=True)

How I tested it ¶

No vibes. I built a ground-truth set by reading the scanned spec, maintenance and torque pages by eye and transcribing 100 known values across 14 categories: model codes, dimensions, weights, capacities, engine specs, valve and cam internals, clearances, ratios, pressures, carburetor jets, suspension, brakes, electrical bulbs and torque settings.

The metric is deliberately strict: exact match. For each value, does its correct rendering, digits and unit together, appear in that corpus? Pass or fail. 249 cm³ is not “95 percent right” when it comes out as 249 cm'. It is wrong. Edit-distance scoring would have flattered every engine and hidden exactly the damage that matters.

Matching normalises whitespace only. It never changes a digit or a unit glyph, so a value that is wrong in a corpus stays wrong in the score.

Here is the whole experiment in one frame: 100 ground-truth values down the rows, grouped into their 14 categories, and the four pipelines across the columns. Green is an exact match, red is a value that came out wrong or never appeared. Read a column top to bottom and you are reading that pipeline’s entire report card.

A tall pass/fail matrix. 100 rows, one per known value, banded into 14 labelled category groups (model codes, dimensions, weights, capacities, engine, engine internal, clearances, gear ratios, pressures, carburetor, chassis, brakes, electrical, torque). Four columns: EasyOCR, Tesseract, Tess+Docling, RapidOCR. Cells are green for an exact match and red for wrong or absent. The EasyOCR column has noticeably more red than the others, concentrated in model codes, capacities, clearances, engine internal and brakes. All four columns are mostly red in the torque band at the bottom, though the RapidOCR column shows a few more green cells there.

The eye does the aggregation before any number is quoted: the EasyOCR column is visibly redder than its neighbours, and the bottom band (torque) is red across all four. The rest of this post is just putting numbers on what that picture already shows.

The result ¶

Bar chart titled OCR fidelity, 100 known values, local pipelines. Four bars: Docling+EasyOCR (live) at 61 percent, Tesseract PDF at 83 percent, Docling+Tesseract layer at 82 percent, and Docling+RapidOCR at 85 percent, the tallest bar.

Pipeline	Exact-match fidelity
Docling + EasyOCR (live)	61%
Tesseract PDF	83%
Docling + Tesseract layer	82%
Docling + RapidOCR	85%

The corpus that was first serving the chatbot is the worst of the four. Both fixes recover most of the gap, but Docling + RapidOCR is the most faithful overall at 85 percent, and it gets there while doing real OCR on the pixels rather than leaning on a separate Tesseract pass. That is the one that shipped.

The lift is not uniform. This is the per-category difference between RapidOCR and the live EasyOCR corpus, in percentage points. Every category recovers or holds; none go backwards.

Diverging horizontal bar chart titled Where swapping EasyOCR for RapidOCR recovers the live corpus. Bars show fidelity change in percentage points, RapidOCR minus EasyOCR, per category. Model codes plus 80, capacities plus 60, clearances plus 60, pressures plus 50, brakes plus 30, engine internal plus 26, engine plus 25, torque plus 20, chassis plus 13. Carburetor, dimensions, electrical, gear ratios and weights are zero. No category is negative.

The biggest wins are exactly where EasyOCR fell down: model codes (+80), and the dense maintenance tables — capacities and clearances (+60 each), pressures (+50), engine internal (+26). Even torque, the hardest category, moves up 20 points. Nothing regresses.

Where the damage hides ¶

The overall number hides the interesting part. Broken down by category, the failures are not spread evenly. They cluster exactly where EasyOCR struggles: model codes, and dense multi-row maintenance tables.

The fastest way to see the structure is a heatmap. Rows are the 14 categories (with their value counts), columns are the four pipelines, and the colour is exact-match fidelity from red (0 percent) to green (100). The EasyOCR column is where the red and amber live; RapidOCR is the greenest column overall.

Heatmap titled fidelity heatmap, 14 categories by 4 local pipelines. Rows are categories with sample sizes, columns are Docling+EasyOCR, Tesseract PDF, Docling+Tesseract layer, Docling+RapidOCR. Cells coloured red to green by fidelity and annotated with the percentage. The EasyOCR column shows red at model codes (20) and torque (30), amber at capacities, clearances and engine internal, and lighter green at brakes (70) and electrical (80). The other three columns are mostly dark green; RapidOCR uniquely reaches 100 at pressures and is highest on torque at 50. The torque row is the reddest band across all four columns. Several rows, weights especially, sit at 50 across every column.

Two patterns jump out of the colour alone. The vertical amber-and-red stripe down the EasyOCR column is the EasyOCR tax. The horizontal red band across the torque row, present in every column, is a different problem entirely, one no OCR engine fully fixes. Hold that second pattern; it is the subject of the honesty section below. Here is the same data as a grouped bar chart, if you prefer reading heights to colours:

Grouped bar chart of OCR fidelity by value category, with Docling+EasyOCR, Tesseract PDF, Docling+Tesseract layer and Docling+RapidOCR bars for each of fourteen categories. EasyOCR sits far below the others on brakes, capacities, clearances, engine_internal and model_code, is roughly level on engine, pressures, weights and ratios. RapidOCR is at or near the top of most categories and is the clear leader on pressures and torque.

The 19-value engine_internal category is a clear signal: valve, cam, cylinder and rocker dimensions are 100 percent in the Tesseract routes, 52.6 percent in the live EasyOCR corpus, and 78.9 percent in RapidOCR. Those are tightly packed maintenance tables, and EasyOCR plus Docling cram and mangle them. This is the kind of page those 19 values come off, a valve and valve-guide specification table with IN and EX columns, limits, and four-decimal millimetre figures packed two and three to a row:

A grayscale scan of the valve specification page from the manual. A dense maintenance table headed XV250U/UC lists valve head width, face width, seat width, stem diameter, guide inside diameter, stem-to-guide clearance and valve-spring free and set lengths, each with separate IN and EX columns and four-decimal millimetre values plus bracketed inch conversions. A small valve-spring diagram sits at the lower left.

Every figure on that page is four significant digits sitting next to another four-digit figure, with the only thing telling 6.975 ~ 6.990 apart from the EX column beside it being its horizontal position. That is exactly the layout EasyOCR-into-Docling collapses, and exactly why engine_internal halves in the live corpus.

Here is what “EasyOCR damage” actually looks like at the character level, true value against what the live corpus produced:

True value	EasyOCR (live) reads	Failure
`XV250U`	`XV2SOU`	digit to letter: 5 to S, 0 to O
`249 cm³`	`249 cm'`	superscript dropped
`11 kg/cm²`	`11 kg/cm?`	superscript to question mark
`302 lb`	`302 Ib`	l to capital I
`65W/60W`	`65W/6oW`	0 to lowercase o
`1st 2nd 3rd 4th`	`Ist 2nd 3rd Ath`	ordinals mangled
`0.6 ~ 0.7 mm`	`0.6 0.7 mm`	range separator dropped

The colour coding in that figure is the whole diagnostic. The blue failures are character-level: the engine read the right cell but mapped a glyph wrong, and a better OCR engine fixes them — which is precisely what RapidOCR does, recovering the model codes, the dropped tildes and the glyph swaps. The red one is layout-level: the digits are all correct but Docling’s table model scattered them, and a better OCR engine does much less for it.

Honesty: RapidOCR is the best one, not a safe one ¶

This is the part the headline number would let you skip. RapidOCR is clearly the most faithful of the four, but 85 percent is not 100. Two stubborn problems remain.

A handful of categories sit at 50 percent across every engine. Weights, for instance: 302 lb and 304 lb come out as 302 Ib/304 Ib no matter which engine reads them, because the failure is a lowercase-L-to-capital-I confusion that every pipeline shares. Superscript units (cm³, kg/cm²) are similar — fragile everywhere.
Torque tables are only half-recovered. The usable unit of a torque spec is the triple (58 Nm, 5.8 m-kg, 42 ft-lb) tied to its bolt. Docling’s table model crams those columns, so the triple survives in only some rows. RapidOCR gets torque to 50 percent (5 of 10), well above the Tesseract routes’ near-zero, but the other half is still scattered — because the cramming is the table-structure model’s doing, not the OCR engine’s.

This is the page the torque values come from, and you can see why a table model struggles with it. Sixty-odd rows, each a part name and a thread size followed by the same value printed three ways (Nm, then m-kg, then ft-lb), with a “Remarks” column that is empty for most rows and the page itself split into two stacked sub-tables:

A grayscale scan of the tightening-torque maintenance page. A long table headed Tightening Torque lists about sixty fasteners, each row giving the part to be tightened, the thread size, and three torque columns headed Nm, m-kg and ft-lb, with a remarks column. Rows include front wheel axle, front axle bolt, steering stem and inner tube, front and rear brake calipers, engine stays, cylinder-head bolts, clutch boss and rotor. The values are densely packed and the three-unit triple for each bolt runs across three narrow adjacent columns.

The information a mechanic needs (this bolt, this torque) lives in the horizontal adjacency of four cells. When the table model mis-maps a single column boundary, the triple that should read 58 / 5.8 / 42 next to “Front wheel axle” gets split across rows, and the value the chatbot retrieves is no longer tied to its bolt. That is a layout failure, not a reading failure, which is why even the best OCR engine only partly fixes it.

So the win from RapidOCR is specific and worth naming precisely: it fixes the character-level damage (model codes, dropped separators, glyph swaps) by reading the pixels better, and it lifts torque from the floor without needing a separate Tesseract pass. It does not fully fix the layout-level damage, because that is Docling’s table model, not the OCR engine. Knowing which failure belongs to which stage is the whole point of testing instead of assuming.

What this means for the live corpus ¶

The takeaway is small and practical: the OCR engine you skip matters more than the one you run. The chatbot was serving the weakest corpus, and a 24-point lift was sitting one option object away. Re-OCRing with RapidOCR (then section-splitting the markdown and re-indexing the Cloudflare AI Search instance) lifts the corpus from 61 to 85 percent fidelity on the values that matter, and does it with a fully local pipeline that the next two manuals can reuse unchanged.

The torque tables still need a separate fix, probably a layout-aware pass or hand-correction of the tightening-torque pages, because no OCR swap will fully un-cram them. And until the superscript and torque problems are solved, the chatbot’s system prompt earns its keep: it is told to flag OCR-ambiguous, safety-critical values and advise checking against the PDF, which sits in a slide-over panel next to every answer.

The test itself is in the document-extract harness: a ground-truth JSON, a strict matcher, and a script that scores every corpus and draws these charts. When the corpus changes, the number moves, and I can see it rather than hope.

Go break it yourself: virago.edestudio.us . Ask for a torque setting and a model code, then check the answer against the scan. That gap, in one question, is what this whole post is about.