Row of Digits OCR: OpenCV CNN versus LLMs

Task: Extract handwritten digits from row table

When extracting information from a scanned form one regularly encounters tabular rows where each table cell contains a single character. For example, in a university class, scanned quizzes from a university class might contain a row of cell where the student writes their university ID. Here is an example of how that could look like:

In a standard OCR pipeline one would

first apply a detection pipeline to find the location of each digit,
preprocess each digit,
and finally apply a recognition model to extract probabilities on each digit.

In the Instructor Pilot Github project, this is exactly the approach that we follow to digitize student quiz submissions in order to grade them and upload them to Canvas LMS. All the details of how this is implemented using OpenCV and a version of MNIST can be found here. With all the optimizations, this pipeline runs on dozens of documents in about 1 second, which allows for a user-friendly experience.

On the plus side, this method can be easily adapted for an arbitrary number of digits, or to include alphanumeric handwritten characters. However, it does not generalize easily to arbitrary table formatting (e.g. arbitrary cell borders), or to non-tabular information extraction.

With the advent of capable vision-capable LLMs there is now a possibility to perform similar information extraction from scanned documents without the painstaking OpenCV preprocessing and without the need for a recognition CNN model specifically trained on the task at hand. See more in the Motivations section of the README.

LLM pipelines

Summary

Below are the results when we compare the top-k accuracy at a digit level (256 10-way multi-class inferences of digits 0-9) for the 32 pages containing IDs. Additionally, we show the percentage of documents confidently detected based on the detection heuristics defined above.

	OpenCV+CNN	outlines + Vision LLM	outlines + Vision LLM	Vision LLM + prob estimatiion
LLM model	N/A	outlines + Qwen2-VL-2B-Instruct	outlines + SmolVLM	Llama3.2-Vision 11B
# samples	1	1	1	150
logits	Yes	No	No	No (but inferred from frequency)
digit_top1	85.16%	98.44%	68.35%	93.36%
digit_top2	90.62%	N/A	N/A	96.48%
digit_top3	94.14%	N/A	N/A	98.44%
8-digit id_top1	N/A	90.63%	53.13%	?
lastname_top1	N/A	100%	93.75%	?
Detect Type	ID (1)	LastName (2) + ID (1)	LastName (2) + ID (1)	ID (1)
ID Avg $d_\mathrm{Lev}$	?	0.1250	2.3750	?
Lastname Avg $d_\mathrm{Lev}$	N/A	0.0000	0.0938	?
Docs detected	90.62% (29/32)	100.00% (32/32)	68.75% (22/32)	100.00% (32/32)
Runtime	~ 1 second	~ 2 minutes (RTX 3060 Ti 8GB)	~ 2 minutes (RTX 3060 Ti 8GB)	~ 5 hours (M2 16GB)

LLM Pipeline 1: Outlines + Qwen2-VL-2B (AWQ)

We use the outlines library as it provides powerful API to enforce a particular JSON schema to the LLM response and we use the transformers library as the Vision LLM backbone, since the Qwen2-VL-2B model page on HuggingFace has extended instructions for usage within this framework. This pipeline works by generating a single response for the Name and University ID that it identifies on the page. For the comparison of LLM-generated IDs and true student IDs we calculate the edit distance (specifically the Levenshtein distance). Using Monte Carlo simulations we verify that for randomly generated 8-digit IDs, the probability that the Levenshtein distance is exactly zero (perfect match) is effectively zero (less than one in a billion). We classify $d_\mathrm{Lev}==0$ as a confident detection. Additionally, we note that the probability that the edit distance is less than or equal to 2 is very small (about 1 per 50,000). Therefore when the identified last name matches exactly a student last name and simultaneously the 8-digit IDs differ by $d_\mathrm{Lev} \le 2$, we classify this also as a confident detection. If the name or the UFID appear in multiple pages we calculate the minimum Levenshtein distances, $d_\mathrm{ID,min}$ and $d_\mathrm{name,min}$, across all images for each student. Overall, we define:

$$ \mathrm{confident\ detection} = (d_\mathrm{ID,min}=0)\ \mathrm{or}\ (d_\mathrm{ID,min} \le 2\ \mathrm{and}\ d_\mathrm{name,min} = 0) $$

We run our pipeline and we get 100% identification success. In order to verify that our heuristic detection rule was based on a sane probabilistic analysis, we plot histograms of the Levenshtein distance between the LLM extracted IDs and all test IDs. We do the same for the last names. With 32 documents, each with two pages of interest (page 1 contains handwritten names and page 3 contains both handwritten names and handwritten IDs) and 32 students we get 2048 (= 32232) distances for IDs and 2048 distances for last names.

Here is the Levenshtein distance probability density for strings of length 8 and a vocabulary size of 10:

And below are the counts we get between each LLM ID and each test ID.

Note

The expected number of $d \le 2$ for 2048 distance pairs due to random chance is 0.0028% * 2048 = 0.056. Accidentally in our dataset two consecutive submissions were mistakenly given the same 8-digit ID (which will never be the case in practice when the 8-digit IDs are generated randomly). These 2 erroneous pages give an extra 2 detections and raise the count of $d \le 2$ from the expected 32 to 34. A robust system should mark these submissions for manual review, or use additionally the Last Name Levenshtein distances.

Overall the distribution matches our expectations. The correct pairs document-student all ended up having a Levenshtein distance less than two. In practice for large collections of documents and handwritten IDs this will not be the case due to human error or omissions. These cases should be marked for manual review or use additional attributes of the document.

A Monte Carlo simulation of last name distances would be more complicated, as it would require a corpus of last names. Nevertheless, we plot a histogram of our counts of LLM-testdata Levenshtein distances. Again we see a clear separation between the bulk of the distribution and the correct identification.

Note

The last names appear both in page 1 and in page 3, so the expected number of confident pairings is approximately 2 * 32 = 64. However, the last names in our dataset are those of the first 32 US presidents. In those there are 3 pairs of presidents that are relatives and share the last name. This introduces approximately 3 * 2 * 2 = 12 spurious small distances. In a usual dataset, the last name coincidence rate will probably be much lower than 10%. So we should expect the confident pairings to be about 64 + 12 = 76. We get a little bit less than that because some names are not written in the standard form First Last.

LLM Pipeline 2: Ollama + LLama-3.2 11B Vision

Note

This pipeline is painfully slow, but it allows one to generate probabilities for each digit. The calibration of these probabilities is discussed below.

Here we use the Ollama API, as it is one of the few transformer LLM implementations that supports the LLama-3.2 11B Vision model (as of November 2024) - one of the most capable open-source models. We used a lower than default top_k parameter (10 instead of 40), in order to reduce creativity. We keep the temperature at 1 in order to do multi-shot inference, although this choice should be further tested. We enforce a JSON format in the output, but we don't yet enforce a specific JSON schema. We base our pipeline on the hypothesis that the frequency of appearance of a response should be a good proxy of each accuracy. We perform multi-shot inference, with 150 generated sampled outputs per image and we use these to construct the digit frequencies at each position. The system message explains the format of the table and the user message contains only the image. Here we define as a confident detection one where log(Prob[ID == correct_ID]) = 3 - ID_length, i.e. the combined inferred probability of the ID_length-digit ID is 10**3 times greater than the random probability of 10**{-ID_length}.

Comparison of Probability Calibration

A good practical introduction to probability calibration can be found in the scikit-learn documentation, here.

Due to the small number of examples in the dataset, there is significant variance in each histogram point in the calibration curve. We see that the OpenCV+CNN pipeline has better calibration, i.e. it is closer to the x=y diagonal compared to the Llama-3.2 pipeline. In the following histograms we also see that the OpenCV+CNN pipeline performs more confident predictions, i.e. the peak near prob=1 is stronger.

Calibration: Issues and possible improvements

The main issue currently of the Llama-3.2 LLM pipeline is that it is many orders of magnitude slower than the OpenCV+CNN pipeline. While extracting the probabilities with OpenCV+CNN takes approximately 0.01-0.1 seconds per document on an M2 MacBook with 16GB of memory, performing multi-shot learning with N=150 takes approximately 10 minutes per document (about 4 seconds per sampled output, plus about a 2 minute warm-up to process the image). By enforcing a JSON schema it might be possible to extract the inferred probabilities for each digit position from a single output, and avoid multi-shot inference all together. It might also be possible to reduce the cost of inference by cropping or reducing the resolution of the image (currently set to 300 dpi). Additionally, without an enforced response format, a non-negligible percentage of the outputs have to be discarded, e.g. because they are an empty string or non-numeric.

It is also a bit unclear how exactly Ollama calculates stores/reuses/caches the KV cache. Since the chat history is not added to the chat context, one would expect naively that the inference cost would be identical for all sampled outputs. Instead there is a noticeable warm-up time for each image, which implies that the processed prompt+image remains in the KV cache until the end of multi-shot inference for each image.

Probably it is also beneficial to switch from the Ollama Python API to the Ollama OpenAI-compatible API, so that we can compare our results with more capable closed-source models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row of Digits OCR: OpenCV CNN versus LLMs

Task: Extract handwritten digits from row table

LLM pipelines

Summary

LLM Pipeline 1: Outlines + Qwen2-VL-2B (AWQ)

LLM Pipeline 2: Ollama + LLama-3.2 11B Vision

Comparison of Probability Calibration

Calibration: Issues and possible improvements

Clone this wiki locally