top of page
  • Kaitos GmbH

Document OCR: Problem solved?

DocOwl 1.5 from the Alibaba Group takes a big step towards general document analysis. But are we there yet? - written by Dr. Peter Kettmann

Reliable automated document recognition in the sense of Visual Document Understanding (VDU) is a problem that has occupied researchers and developers for decades, but so far without satisfactory results. The term VDU refers to the understanding of the content and semantics of structured data, such as in forms (vehicle registration certificates, identity cards, etc.) or in scientific evaluations with diagrams and tables. The difficulty here is that it is by no means sufficient to read out the text depicted in each case (this is provided by a standard OCR). This would just give you a more or less large pile of text from which you would have to manually search for the information you need. What is desired instead is structured data on the basis of which a reliable digital process can be set up.

What is Visual Document Understanding? An example.

An example: Let's assume we are interested in the emission class of cars (e.g. because we are a GHG analyst) and have to read out hundreds of vehicle registration certificates every day. This process would be very time-consuming manually, so we try using OCR software. In the case of our company car, for example, the result is something like this:

MS KI261E \n Kaitos GmbH \n 61.2 Vort \n 02.03. 29 \n M1 \n 81480 AA2000257 \n AA \n Tesla \n 0.03 \n E1LR. \n B.Gb 1s5N \n 120001 1288 \n 4694 \n 1443 \n 11225 \n 18.50 \n 1700-1713 \n 1110 \n 11103 \n 2014 \n 1257 \n 1257 \n 2014 \n • Model 3 \n TESLA (USA) \n Fz.z.Pers. bef b. 8 spi \n Limousine \n 51. 5 \n 235/15R18 98Y \n 2351503 9. 08Y \n 715/2007*2018/1832AX \n BLAU \n 2017/1151.WLTP.reine Elekt-69ze1*2007/46*1293 15. \n 115/ \n 0004.031 \n 22.0121 \n LIK GA611190. \n • Battentekapazität 55 KWh \n Dorotheenstraße 26 A \n 48145 Münster \n 02.03.2021

Figure 1: Vehicle registration document of our company car: What is the emission class?

Apart from the usual OCR reading errors: What is the emission class here? In order to be able to evaluate this automatically in a stable manner, the result of the document recognition should look more like this:

  "license_plat_number": "MS KI261E",
  "owner": "Kaitos GmbH",
  "adress": "Dorotheenstraße 26A, 48145 Münster",
  "emssion_class": "Elektro",

This data can be easily processed automatically and an efficient handling process can be set up.

There are many use cases like that and correspondingly many methods and providers that deal with generating this type of structured data. Either directly from an image or on the basis of a raw OCR result. So far, however, without resounding success, at least if we are talking about a general solution. In other words, a solution that can handle any document, even without ever having seen it explicitly during training. A group of researchers from the Alibaba Group are working on just such a solution and with their new release DocOwl 1.5 they seem to have taken a big step in this direction. Here is a rough outline of their approach.

What is DocOwl 1.5?

DocOwl 1.5 is a large foundational model that essentially consists of a large, ready-to-use language model (the LLM Llama-7B) and a so-called Vision Transformer (ViT). The division of tasks between these two components is roughly as follows:

  • the Vision Transformer (a popular AI model for processing images) has the task of processing an image from a document and turning it into a representation with a low number of dimensions (here a vector with 1024 numbers). The latter is also called a feature vector and, after training, should contain information about the words contained in the image.

  • The LLM receives the feature vector from the ViT, extracts the words contained in the image and "understands" their context. It is therefore responsible for assigning a meaning to the data

In other (very simplified) words: the ViT reads and the LLM understands. Meaning, the ViT takes over the task of OCR and generates raw text, while the LLM turns the loose sequence of words into a structured data record, as it recognizes connections within the raw "text heap". This rough structure of DocOwl 1.5 is nothing new in itself and has already been implemented several times in one form or another.

However, what sets DocOwl 1.5 apart from the other so-called Multimodal Large Language Models (MLLM) are, on the one hand, clever architectural decisions in the structure of the individual components and how they are connected to each other.

On the other hand, it is the way in which it is trained, i.e. with which data and for which tasks (details on the implementation and the training data will follow in a later article). Both together lead to DocOwl 1.5 developing a kind of spatial understanding, which is essential for understanding structured data in documents. The LLM of DocOwl 1.5 therefore not only sees a raw "heap of text", but also where the individual texts are localized on the document. From this, it can deduce how the individual text modules relate to each other over and above the pure text content (this is, of course, a highly simplified representation. It is difficult to say exactly where the boundary in the division of tasks between ViT and LLM lies; details on this will follow in the mentioned later article).

What is DocOwl 1.5 capable of?

...localize texts.

The localization capabilities are demonstrated very impressively by some examples that can be found in the DocOwl 1.5 publication. Figure 2 illustrates the ability of DocOwl 1.5 to localize texts and read localized texts using two of them.

Figure 2: Examples of the localization capability of DocOwl 1.5. source:

The first example clearly shows how strong DocOwl 1.5's "spatial understanding" is. In a fairly large document with a lot of different content and a sophisticated layout, it can find the desired sentence and localize it very precisely. The latter in particular is very impressive, as LLMs (and it is the LLM that gets asked the question) are not actually made to "think" spatially, but only have a linguistic understanding of the world. This shows that the interaction between the ViT as the visual component and the LLM as the linguistic component works very well.

The second example in Fig. 2 shows the opposite direction. The model receives the coordinates of the desired text and is then supposed to read it out. This also works very well, i.e. the text is obviously localized well, but the system makes minor errors when reading (the characters printed in red in the model's output). For example, it reads "2/14/20" instead of "2/11/20", which looks like a simple reading error. The second error "1:45" instead of "1:53", on the other hand, looks more like a localization error, as the system seems to have made a mistake in the line. The same applies to the error "Mo" instead of "Tu", where it obviously looked one line too far up. The incorrect marking in the text "1:45" shows that similar mistakes happen to humans all the time. The 5 should have been red there too... :)

The two examples shown, as well as the others that can be found in the paper, indicate to me that the problem of general document recognition has been solved conceptually. In its current implementation, however, the system still has weaknesses (more on the implications in terms of usability below).

...parse tables

Figure 2 has clearly shown that the localization of individual text modules in documents works well. Another example from the DocOwl 1.5 publication shows that the system can also do this on a larger scale. In the upper part of Fig. 3, DocOwl is given the task of reading a table in a document and converting it into a structured data record in Markdown format ("parsing" in technical jargon).

Figure 3: Examples of reading out a table and a bar chart. Source:

You can see that this also works very well. The table seems to me to be read perfectly (despite the poor resolution; or is this just a display error in the publication?) and the values contained in the image would thus be programmatically available for automated further processing. In principle, this would achieve the goal that I described at the beginning using the example of the registration document for our company car. Very impressive!

The second example goes a little further. Here, the work that DocOwl has to do is more difficult, as it has to do more than just transfer the spatial structure of a table one-to-one, but it also has to read out a graphically more complex bar chart. Here, the semantically related values are not neatly lined up below or next to each other, but are arranged in a more complex way. But even this does not seem to cause the system any problems and it transfers the data again into a table in Markdown format.

...analyze documents

As we have seen, DocOwl 1.5 has obviously mastered the most important basics for evaluating documents quite well: the exact localization of texts including the semantic assignment of the individual components to each other, as well as the recognition of the words themselves. The last two examples that I would like to show here illustrate that DocOwl 1.5 not only manages to solve rather "stupid" parsing tasks: It can also extract complex semantic relationships from the data it reads (beyond a basic mapping between them). In other words, it can make full use of the capabilities of a well-trained LLM to understand a document.

Figure 4: Examples from the field of Visual Question Answering (VQA). Source:

The two examples in Figure 4 show the capabilities of DocOwl 1.5 in the area of Visual Question Answering (VQA). Here, the model is asked a question in natural language and it is supposed to formulate an appropriate answer in natural language (instead of just reading or parsing the content of the relevant area). To ensure this, the developers have fine-tuned DocOwl 1.5 using a self-compiled "instruction tuning" dataset, similar to how OpenAI created ChatGPT on the basis of GPT4. The result is DocOwl 1.5-Chat, which can formulate detailed overall contexts in natural language.

The first example in Fig. 4 demonstrates this very well. On the one hand, the model returns the requested number, but also embeds it in the context of the entire document. This is shown even more impressively in the second example. The system not only answers the question bluntly, but also explains where the statement comes from. Marked in red, however, it also shows a major and all too familiar weakness of the system: although the last part of the answer sounds extremely plausible, it is made up. Like all LLMs, DocOwl 1.5 is prone to hallucinations, which is to be expected as the language processing is based on a standard LLM.

And what can't DocOwl 1.5 actually do?

From the above descriptions, you can see how impressive the capabilities of DocOwl 1.5 are in my view. It can handle large documents with complex structures, localize and read out texts very precisely and parse structured data such as tables and diagrams into a programmatically accessible format. In the examples shown, however, you could also see that the system still has significant weaknesses. DocOwl 1.5 chat shows the usual hallucinations, so you have to treat its statements with caution. DocOwl 1.5 itself localizes and reads impressively well for a general approach. The first impression, however, is that it will not come close to specialized solutions. I.e. as an upstream step of a semi-automated solution it should be quite helpful. However, the reading results and the localization of individual words do not yet appear to be sufficient for fully automated data processing without significant human monitoring.

Another task area that will hardly be accessible for DocOwl 1.5 is the reading of documents that require prior knowledge to understand. One example is German vehicle registration certificates (see Fig. 1): While the fields on the first page (license plate number, address, etc.) can be clearly assigned by their designation on the certificate, this is hardly possible on the other two pages. Which of the fields describes the tire diameter, for example? In principle, you could work here with the printed field numbering, but this is not legible in most of the images I have to deal with in my work.

For such cases, it should be possible to condition the model with contextual knowledge (keyword "one/few-shot learning"). For example, with a sample certificate on which the individual fields are marked with a description so that the model knows the meaning behind the respective field (this is not possible with DocOwl 1.5). Or you would have to fine-tune the model on a specially created data set, which would be very time-consuming due to the size of the model. The advantages compared to a smaller specialist model would be questionable.

One last problem I see with the productive use of DocOwl 1.5 is how good the performance is on real "unclean" data. As far as it is described in the paper and recognizable in the examples, the system was mostly trained on images from PDFs, websites or similar, i.e. on perfect, clean images that were artificially distorted at best. Experience has shown that there is always a generalization gap, i.e. a system that has seen no or too few "real" images such as noisy scans or cell phone photos during training specializes in perfect images and makes compromises with the real ones. The performance on "real" images therefore remains to be seen and can only be evaluated as soon as the trained models are available ("Coming soon" is says in the official repository).

In any case, I am eagerly awaiting the models and the associated training data. Then we will see what DocOwl 1.5 can and cannot offer under realistic conditions.

Problem solved?

After the explanations in the previous section, it should be clear that my answer to this question is mixed. From my point of view, it can be said that the problem of general document recognition has been solved conceptually and it is only a matter of time before the recognition performance is fully ready for the market. In any case, this applies to the pure parsing of documents as with DocOwl 1.5. The chat version has the usual problems of hallucinations, and it is questionable when and how this can be solved. That means, as far as conceptually parsing structured data such as tables is concerned, my answer is "Conceptually, yes".

However, if you look at the usability of such models from an economic point of view, things look somewhat different. The problems known from popular LLMs also occur here. Even more than pure LLMs, MLLMs such as DocOwl 1.5 are extremely resource-hungry. This is particularly true of MLLMs for documents, as high resolutions are required for the input images. How large the model will be in the end can only be answered once the trained weights have been published. My expectation, however, is that it can hardly be run on a standard GPU, let alone trained further. Given the low margins available on the document digitization market, direct commercial exploitation of such MLLMs therefore seems to me to be unprofitable for the time being. Here too, it remains to be seen what future developments will bring and how much the model sizes can be reduced in the future.

My impression is therefore that general document recognition will not be able to hold its own against specialized models any time soon, both in terms of the quality of information extraction and in terms of cost-effectiveness. Specialized models can be trained to extremely high levels of accuracy and designed to be very lean, so that runtimes in the tenths of a second range are even possible on consumer-grade CPUs. In my view, the field of fully automated, large-scale document processing still belongs to the small, specialized models. Where the general MLLMs can certainly score points in is in the partial automation of processes on a smaller scale. For example, if you only process one document from time to time or many different types of documents, but in smaller quantities, it is not worth the effort of training a specialist. In this case, it may be worthwhile to accept high costs per document and, in case of doubt, to check the return of the model manually.

My answer to the question "Problem solved?" is therefore a resounding "Yes! ...and no.".

0 views0 comments

Recent Posts

See All


bottom of page