Row of Digits OCR: OpenCV CNN versus LLMs

Task: Extract handwritten digits from row table

When extracting information from a scanned form one regularly encounters tabular rows where each table cell contains a single character. For example, in a university class, scanned quizzes from a university class might contain a row of cell where the student writes their university ID. Here is an example of how that could look like:

In a standard OCR pipeline one would

first apply a detection pipeline to find the location of each digit,
preprocess each digit,
and finally apply a recognition model to extract probabilities on each digit.

In the Instructor Pilot Github project, this is exactly the approach that we follow to digitize student quiz submissions in order to grade them and upload them to Canvas LMS. All the details of how this is implemented using OpenCV and a version of MNIST can be found here. With all the optimizations, this pipeline runs on dozens of documents in about 1 second, which allows for a user-friendly experience.

On the plus side, this method can be easily adapted for an arbitrary number of digits, or to include alphanumeric handwritten characters. However, it does not generalize easily to arbitrary table formatting (e.g. arbitrary cell borders), or to non-tabular information extraction.

With the advent of capable vision-capable LLMs there is now a possibility to perform similar information extraction from scanned documents without the painstaking OpenCV preprocessing and without the need for a recognition CNN model specifically trained on the task at hand. See more in the Motivations section of the README.

LLM pipelines

LLM Pipeline 1: Outlines + Qwen2-VL-2B (AWQ)

We use the outlines library as it provides powerful API to enforce a particular JSON schema to the LLM response and we use the transformers library as the Vision LLM backbone, since the Qwen2-VL-2B model page on HuggingFace has extended instructions for usage within this framework. This pipeline works by generating a single response for the Name and University ID that it identifies on the page. For the comparison of LLM-generated IDs and true student IDs we calculate the edit distance (specifically the Levenshtein distance). Using Monte Carlo simulations we verify that for randomly generated 8-digit IDs, the probability that the Levenshtein distance is exactly zero (perfect match) is effectively zero (less than one in a billion). We classify $d_\mathrm{Lev}==0$ as a confident detection. Additionally, we note that the probability that the edit distance is less than or equal to 2 is very small (about 1 per 50,000). Therefore when the identified last name matches exactly a student last name and simultaneously the 8-digit IDs differ by $d_\mathrm{Lev} \le 2$, we classify this also as a confident detection. If the name or the UFID appear in multiple pages we calculate the minimum Levenshtein distances, $d_\mathrm{ID,min}$ and $d_\mathrm{name,min}$, across all images for each student. Overall, we define:

$$ \mathrm{confident\ detection} = (d_\mathrm{ID,min}=0)\ \mathrm{or}\ (d_\mathrm{ID,min} \le 2\ \mathrm{and}\ d_\mathrm{name,min} = 0) $$

LLM Pipeline 2: Ollama + LLama-3.2 11B Vision

Note

This pipeline is painfully slow, but it allows one to generate probabilities for each digit. The calibration of these probabilities is discussed below.

We used the Ollama API, as it is one of the few transformer LLM implementations that supports the LLama-3.2 11B Vision model (as of November 2024) - one of the most capable open-source models. We used a lower than default top_k parameter (10 instead of 40), in order to reduce creativity. We keep the temperature at 1 in order to do multi-shot inference, although this choice should be further tested. We enforce a JSON format in the output, but we don't yet enforce a specific JSON schema. We base our pipeline on the hypothesis that the frequency of appearance of a response should be a good proxy of each accuracy. We perform multi-shot inference, with 150 generated sampled outputs per image and we use these to construct the digit frequencies at each position. The system message explains the format of the table and the user message contains only the image. Here we define as a confident detection one where log(Prob[ID == correct_ID]) = 3 - ID_length, i.e. the combined inferred probability of the ID_length-digit ID is 10**3 times greater than the random probability of 10**{-ID_length}.

Summary

Below are the results when we compare the top-k accuracy at a digit level (256 10-way multi-class inferences of digits 0-9) for the 32 pages containing IDs. Additionally, we show the percentage of IDs confidently detected.

	OpenCV+CNN	outlines + Qwen2-VL-2B	Llama3.2-Vision 11B
digit_top1	85.16%	?	93.36%
digit_top2	90.62%	N/A	96.48%
digit_top3	94.14%	N/A	98.44%
Detect Type	ID (1)	LastName (2) + ID (1)	ID (1)
Docs detected	90.62% (29/32)	100.00% (32/32)	100.00% (32/32)
Runtime	~ 1 second	~ 2 minutes (RTX 3060 Ti 8GB)	~ 5 hours (M2 16GB)

Comparison of Probability Calibration

A good practical introduction to probability calibration can be found in the scikit-learn documentation, here.

Due to the small number of examples in the dataset, there is significant variance in each histogram point in the calibration curve. We see that the OpenCV+CNN pipeline has better calibration, i.e. it is closer to the x=y diagonal compared to the Llama-3.2 pipeline. In the following histograms we also see that the OpenCV+CNN pipeline performs more confident predictions, i.e. the peak near prob=1 is stronger.

Calibration: Issues and possible improvements

The main issue currently of the Llama-3.2 LLM pipeline is that it is many orders of magnitude slower than the OpenCV+CNN pipeline. While extracting the probabilities with OpenCV+CNN takes approximately 0.01-0.1 seconds per document on an M2 MacBook with 16GB of memory, performing multi-shot learning with N=150 takes approximately 10 minutes per document (about 4 seconds per sampled output, plus about a 2 minute warm-up to process the image). By enforcing a JSON schema it might be possible to extract the inferred probabilities for each digit position from a single output, and avoid multi-shot inference all together. It might also be possible to reduce the cost of inference by cropping or reducing the resolution of the image (currently set to 300 dpi). Additionally, without an enforced response format, a non-negligible percentage of the outputs have to be discarded, e.g. because they are an empty string or non-numeric.

It is also a bit unclear how exactly Ollama calculates stores/reuses/caches the KV cache. Since the chat history is not added to the chat context, one would expect naively that the inference cost would be identical for all sampled outputs. Instead there is a noticeable warm-up time for each image, which implies that the processed prompt+image remains in the KV cache until the end of multi-shot inference for each image.

Probably it is also beneficial to switch from the Ollama Python API to the Ollama OpenAI-compatible API, so that we can compare our results with more capable closed-source models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly