-
Notifications
You must be signed in to change notification settings - Fork 0
Row of Digits OCR: OpenCV CNN versus LLMs
When extracting information from a scanned form one regularly encounters tabular rows where each table cell contains a single character. For example, in a university class, scanned quizzes from a university class might contain a row of cell where the student writes their university ID. Here is an example of how that could look like:

In a standard OCR pipeline one would
- first apply a detection pipeline to find the location of each digit,
- preprocess each digit,
- and finally apply a recognition model to extract probabilities on each digit.
In the Instructor Pilot Github project, this is exactly the approach that we follow to digitize student quiz submissions in order to grade them and upload them to Canvas LMS. All the details of how this is implemented using OpenCV and a version of MNIST can be found here. With all the optimizations, this pipeline runs on dozens of documents in about 1 second, which allows for a user-friendly experience.
On the plus side, this method can be easily adapted for an arbitrary number of digits, or to include alphanumeric handwritten characters. However, it does not generalize easily to arbitrary table formatting (e.g. arbitrary cell borders), or to non-tabular information extraction.
With the advent of capable vision-capable LLMs there is now a possibility to perform similar information extraction from scanned documents without the painstaking OpenCV preprocessing and without the need for a recognition CNN model specifically trained on the task at hand. See more in the Motivations section of the README
.
Below are the results when we compare the top-k accuracy at a digit level (256 10-way multi-class inferences of digits 0-9) for the 32 pages containing IDs. Additionally, we show the percentage of documents confidently detected based on the detection heuristics defined above.
OpenCV+CNN | outlines + Vision LLM | outlines + Vision LLM | Vision LLM + prob estimatiion | |
---|---|---|---|---|
LLM model | N/A | outlines + Qwen2-VL-2B-Instruct | outlines + SmolVLM | Llama3.2-Vision 11B |
# samples | 1 | 1 | 1 | 150 |
logits | Yes | No | No | No (but inferred from frequency) |
digit_top1 | 85.16% | 98.44% | 68.35% | 93.36% |
digit_top2 | 90.62% | N/A | N/A | 96.48% |
digit_top3 | 94.14% | N/A | N/A | 98.44% |
8-digit id_top1 | N/A | 90.63% | 53.13% | ? |
lastname_top1 | N/A | 100% | 93.75% | ? |
Detect Type | ID (1) | LastName (2) + ID (1) | LastName (2) + ID (1) | ID (1) |
ID Avg |
? | 0.1250 | 2.3750 | ? |
Lastname Avg |
N/A | 0.0000 | 0.0938 | ? |
Docs detected | 90.62% (29/32) | 100.00% (32/32) | 68.75% (22/32) | 100.00% (32/32) |
Runtime | ~ 1 second | ~ 2 minutes (RTX 3060 Ti 8GB) | ~ 2 minutes (RTX 3060 Ti 8GB) | ~ 5 hours (M2 16GB) |
We use the outlines
library as it provides powerful API to enforce a particular JSON schema to the LLM response and we use the transformers
library as the Vision LLM backbone, since the Qwen2-VL-2B model page on HuggingFace has extended instructions for usage within this framework. This pipeline works by generating a single response for the Name and University ID that it identifies on the page. For the comparison of LLM-generated IDs and true student IDs we calculate the edit distance (specifically the Levenshtein distance). Using Monte Carlo simulations we verify that for randomly generated 8-digit IDs, the probability that the Levenshtein distance is exactly zero (perfect match) is effectively zero (less than one in a billion). We classify
We run our pipeline and we get 100% identification success. In order to verify that our heuristic detection rule was based on a sane probabilistic analysis, we plot histograms of the Levenshtein distance between the LLM extracted IDs and all test IDs. We do the same for the last names. With 32 documents, each with two pages of interest (page 1 contains handwritten names and page 3 contains both handwritten names and handwritten IDs) and 32 students we get 2048 (= 32232) distances for IDs and 2048 distances for last names.
Here is the Levenshtein distance probability density for strings of length 8 and a vocabulary size of 10:
And below are the counts we get between each LLM ID and each test ID.
Note
The expected number of
Overall the distribution matches our expectations. The correct pairs document-student all ended up having a Levenshtein distance less than two. In practice for large collections of documents and handwritten IDs this will not be the case due to human error or omissions. These cases should be marked for manual review or use additional attributes of the document.
A Monte Carlo simulation of last name distances would be more complicated, as it would require a corpus of last names. Nevertheless, we plot a histogram of our counts of LLM-testdata Levenshtein distances. Again we see a clear separation between the bulk of the distribution and the correct identification.
Note
The last names appear both in page 1 and in page 3, so the expected number of confident pairings is approximately 2 * 32
= 64. However, the last names in our dataset are those of the first 32 US presidents. In those there are 3 pairs of presidents that are relatives and share the last name. This introduces approximately 3 * 2 * 2
= 12 spurious small distances. In a usual dataset, the last name coincidence rate will probably be much lower than 10%. So we should expect the confident pairings to be about 64 + 12
= 76. We get a little bit less than that because some names are not written in the standard form First Last
.
Note
This pipeline is painfully slow, but it allows one to generate probabilities for each digit. The calibration of these probabilities is discussed below.
Here we use the Ollama API, as it is one of the few transformer LLM implementations that supports the LLama-3.2 11B Vision model (as of November 2024) - one of the most capable open-source models. We used a lower than default top_k
parameter (10 instead of 40), in order to reduce creativity. We keep the temperature at 1 in order to do multi-shot inference, although this choice should be further tested. We enforce a JSON format in the output, but we don't yet enforce a specific JSON schema. We base our pipeline on the hypothesis that the frequency of appearance of a response should be a good proxy of each accuracy. We perform multi-shot inference, with 150 generated sampled outputs per image and we use these to construct the digit frequencies at each position. The system message explains the format of the table and the user message contains only the image. Here we define as a confident detection one where log(Prob[ID == correct_ID]) = 3 - ID_length
, i.e. the combined inferred probability of the ID_length
-digit ID is 10**3
times greater than the random probability of 10**{-ID_length}
.
A good practical introduction to probability calibration can be found in the scikit-learn documentation, here.
Due to the small number of examples in the dataset, there is significant variance in each histogram point in the calibration curve. We see that the OpenCV+CNN pipeline has better calibration, i.e. it is closer to the x=y
diagonal compared to the Llama-3.2 pipeline. In the following histograms we also see that the OpenCV+CNN pipeline performs more confident predictions, i.e. the peak near prob=1
is stronger.
The main issue currently of the Llama-3.2 LLM pipeline is that it is many orders of magnitude slower than the OpenCV+CNN pipeline. While extracting the probabilities with OpenCV+CNN takes approximately 0.01-0.1 seconds per document on an M2 MacBook with 16GB of memory, performing multi-shot learning with N=150 takes approximately 10 minutes per document (about 4 seconds per sampled output, plus about a 2 minute warm-up to process the image). By enforcing a JSON schema it might be possible to extract the inferred probabilities for each digit position from a single output, and avoid multi-shot inference all together. It might also be possible to reduce the cost of inference by cropping or reducing the resolution of the image (currently set to 300 dpi). Additionally, without an enforced response format, a non-negligible percentage of the outputs have to be discarded, e.g. because they are an empty string or non-numeric.
It is also a bit unclear how exactly Ollama calculates stores/reuses/caches the KV cache. Since the chat history is not added to the chat context, one would expect naively that the inference cost would be identical for all sampled outputs. Instead there is a noticeable warm-up time for each image, which implies that the processed prompt+image remains in the KV cache until the end of multi-shot inference for each image.
Probably it is also beneficial to switch from the Ollama Python API to the Ollama OpenAI-compatible API, so that we can compare our results with more capable closed-source models.