top of page
  • Kaitos GmbH

What is DocOwl 1.5 really capable of?

A practical test of document recognition from the Alibaba Group - by Dr. Peter Kettmann


The previous article was about the AI model DocOwl 1.5, the Alibaba Group's latest development in the field of general document recognition. The associated scientific publication received a lot of attention and had a lot to offer: using a number of examples, reliable reading and localization capabilities for text in complex documents were demonstrated. Alongside the extraction of tables, the analysis of diagrams and the evaluation of key information. All in one model and open source including the associated training data. In the corresponding test benchmarks, the model achieved outstanding results compared to the competition.


As already described in my previous article, however, some questions arose regarding usability under everyday conditions and from an economic point of view: How does the model perform on "real" data such as dirty scans or even smartphone photos? What hardware requirements does the model have and what are the runtimes? I would like to clarify these questions in this article. To do this, I downloaded the model from the official Huggingface space and prepared some photos, scans and PDF documents and tested the model on various standard tasks. The results of the tests and an analysis of the hardware requirements follow below. But first a small disclaimer.


Disclaimer


As you will see in the evaluation of the tests, the results are surprisingly poor in several cases. This is partly because some of the tests are edge cases that I chose to explore the limits of the model. However, the performance of the model is in fact significantly worse than I expected after reading the publication, even on data that I believe exactly covers the training domain.


Leaving aside the possibility of a problematic selection of show cases in the original publication (such as a test train leak in the data), I cannot find a satisfactory explanation for this. Perhaps I used the model incorrectly, but I stuck to the examples from the original repository and tried many variations. In some cases I used the code from official answers to Github issues one-to-one, which also did not change anything. In addition, the same prompts work for some images but not for others.


I therefore think it is unlikely that there is an operating error on my part. However, perhaps someone would like to try it out for themselves: the online demo is freely available and easy to use. The format of the prompts can be seen in the examples shown below. Perhaps someone will find a better way to access the model.


1. Reading localized texts


The first use case that I would like to examine here should be part of the basic repertoire of a good general document recognition: the targeted reading of certain text areas in the document. To do this, you first localize a relevant text area (e.g. with a text detector in front of it) and then instruct the model to read this text. In the following, I test the capabilities of DocOwl 1.5 on this task for various document types, with the format of the queries corresponding to the convention from the DocOwl 1.5 publication.


The first case, the middle page of a scanned car registration in good quality, is shown in Fig. 1.


Figure 1: Localization and reading of various fields on a scanned car registration. The red text boxes were added later; the model did not see them.


The text was obviously well localized in all three cases shown, as the reading results match what the model was supposed to read out. The model's spatial understanding therefore seems to work in principle. The model also shows solid performance when reading out the texts: In the second case, it makes several errors (the "l" in "Spl." actually looks more like a one), but in the other two cases the model reads without errors. I find that somewhat impressive when you consider that the model was applied to an unknown document class without any fine tuning. However, the first weaknesses in reading out the texts are already apparent here.


So far, so good: scanned documents are actually included in the training set of DocOwl 1.5. What happens when you move significantly outside the training domain? A photo of a car registration can be seen in Fig. 2.


Figure 2: Photo of a car registration for evaluating the model's performance. The red text boxes were added later; the model did not see them.


The first request tested in this photo is for the vehicle manufacturer (="TESLA (USA)"):


Query: „Identify the text within the bounding box <bbox>422, 448, 510, 473</bbox>“

DocOwl's answer:


DocOwl 1.5: „2017/1151:WLTP“

This text is actually four lines below, but other than that it is read correctly. This looks like a weakness in localization.

The same thing happens when you try to read the license plate or the tire specifications (middle right). DocOwl 1.5 sticks with the answer " 2017/1151:WLTP ". When you prompt for the vehicle model ("Model 3"), the answer "(5" finally comes, which seems more like a hallucination.


It is difficult to say exactly what happens in the model when reading this photo. It seems as if the model is simply overwhelmed. Obviously it cannot feel overwhelmed, but it is often observed that models produce erratic, incomprehensible results when they are confronted with data that is far outside their training domain. Just like people who are completely overwhelmed by a new task that they did not get any instructions for. In this sense, the shown behavior is to be expected.


Now that we have pushed the model to its limits, here is another example that should be represented by the model's training set: A "perfect" invoice, converted directly from a PDF document into an image file. There are no impurities here, as with scans or photos, the alignment of the document is perfect, the writing is absolutely clean. Similar documents are included in large quantities in the training data of DocOwl 1.5, i.e., the reading result should be correspondingly good.



Figure 3: Localization and reading of various fields on a German invoice (direct PDF export). The text boxes were added later; the model did not see them.


However, if you take a look at the results, they are actually rather disappointing. In the example in Fig. 3, the billing period in the top right of the image was read out perfectly. However, the entry in the table header in the bottom right already shows a first error: the "(EUR)" at the beginning of the entry was hallucinated. This shouldn't happen.


If you now try to read the VAT ID of the invoice issuer in the top left, you get the answer "Invoice date 2024/01/13", which is one line above on the right-hand side, so it has obviously been localized incorrectly.

It is difficult to understand why this happens. One idea is that this is because it is a German document, but the model was trained on an English data set. In my opinion, however, this would lead to a reading disability and thus to individual spelling errors (or something similar). It is unlikely that this would cause the localization to become grossly confused. In fact, the texts themselves are read without errors. Only parts were hallucinated or they were incorrectly localized.


To rule out a language problem, you can perform the same experiment with an English document. For example, on an English sample invoice from canva.com (see Figure 4). The result is comparable: one field was read correctly, two were incorrectly localized.


Figure 4: Localization and reading of various fields on an English invoice (direct PDF export). The text boxes were added later; the model did not see them. (Sources: canva.com )


Something is obviously going wrong here.


2. Reading and localizing tables


The second use case that I would like to look at here is reading tables. That is, parsing tables into Markdown format. The latter is a common data format for storing structured data that is very simple but comparatively expressive.


To do this, I again follow the convention of the DocOwl publication and use prompts in the form “ Convert the table into Markdown format ” or “ Parse the table into Markdown syntax ” . Both produced roughly the same results, which is why I only show the first variant.


First of all, a practical example: An attempt to read tables directly from the two invoice documents shown (blue boxes in Fig. 3 and Fig. 4). For the upper table of the German invoice in Fig. 3, the result is:


Query: „Convert the table within the bounding box <bbox>63, 433, 935, 508</bbox> into Markdown format.“
DocOwl 1.5: „| Rechnung | Preis/Einheit (EUR) | Anzahl | Steuern (%) | Total (EUR) | 
| --- | --- | --- | --- | --- | 
| Beschreibung | 8.39 | 1 | 19 | 8.39 | 
| Premium- | 8.39 | 1 | 19 | 8.39 | 
| Konnektivität | 8.39 | 1 | 19 | 8.39 | 
| Teilsumme | 8.39 | 1 | 19 | 8.39 | 
| Gesamtbetrag (EUR) | 9.99 | 1 | 19 | 9.99 | 
| Code | Gesamtbetrag (EUR) | Steuern (%) | Gesamtbetrag (EUR) | Gesamtbetrag (EUR) | 
| DESR | 8.39 | 19.00 | 1.60 | 1.60 | 
| Tesla Germany GmbH Ludwig-Prandtl-Straße 27 - 29, 12568 Berlin, Deutschland Amtgericht Charlottenburg, HRB 235281 B Geschäftsführer: Stephan Werkman, Simon Zwahlen, Joseph Gruber“

And for the bottom one:


Query: „Convert the table within the bounding box <bbox>63, 433, 935, 508</bbox> into Markdown format.“
DocOwl 1.5: „Tesla Germany GmbH 
Ludwig-Prandtl-Straße 27 - 29
12526 Berlin DE
UST-IdNr: DE265761887
[...]
Code Gesamtbetrag (EUR) Steuer (%) Gesamtsumme Steuern (EUR) 
DESR 8.39 19.00 1.60“

The first attempt at least produces a Markdown table, but with rows and columns mixed together and both tables combined. With the second table, DocOwl 1.5 only reads parts of the document text. The same thing happens when trying to read the table in the English invoice. Localization and extraction of tables at the same time obviously does not work.


However, in the publication for DocOwl 1.5 it looks as if table extraction was not trained in combination with localization. So here is another attempt: reading the isolated tables.


3. Reading isolated tables


To read the isolated tables, I cut out the tables from the invoice documents shown above and instructed the model as usual along the convention from the DocOwl 1.5 release.




Figure 5: Reading the isolated tables from Fig.3.


The first table of the German invoice results in the output of part of the table header followed by the output of thousands of spaces. Here, the transformer model in the LLM, which is responsible for the text output, obviously only receives the stop token very late. Perhaps this is because the separators between the columns are missing in the table header?


The second table works perfectly. The text is read without errors and the output is valid Markdown. The same applies to the table from the English sample invoice. Very good! It looks like reading isolated tables from clean PDF exports works well in principle, with some instabilities that would have to be fixed for productive use.


Figure 6: Reading the isolated table from Fig.4.


4. Extraction of key information


To test the extraction of key information, I focused on examining the two invoices, as the field identifiers in the images of the vehicle registration shown above are not readable. A general document recognition system that has not been explicitly trained on the latter documents therefore has no chance of assigning a semantic meaning to the printed text.


The picture is mixed in the case of the two invoice documents. When asked about the invoice issuer and the total amount of the German invoice, the model answers correctly. However, when asked about the invoice recipient, it answers "Tesla Germany GmbH". Here, however, it should be noted again that the model was not trained on German (even if the underlying LLM is likely to be largely language agnostic).


Figure 7: Extraction of various key information from a German invoice (direct PDF export). The text boxes were added later; the model did not see them.


So what about the English sample invoice? The result is similar. In this case, the invoice recipient and total amount are correct, but the invoice issuer is incorrect. However, if you ask the model " Who is the sender? ", the correct result is obtained.


Figure 8: Extraction of various key information from an English invoice (direct PDF export). The text boxes were added later; the model did not see them.


Thus, the quality of the result in this case is mixed, even though most of the key information was read correctly.


Hardware requirements and runtime


All experiments described above were carried out on a NVIDIA GeForce RTX 4090 with 24 GB VRAM, which is currently the most powerful consumer GPU available. For comparison: A Tesla T4, which is the standard cloud GPU for inference tasks (i.e. only for model execution, not for training), has 16 GB VRAM and achieves about 1/8 of the computing speed of a Geforce 4090. At a cost of around €300 per month per GPU instance (possibly cheaper if runtime discounts are applied).


During inference, approximately 19 GB of VRAM was consumed on the GeForce 4090, which exceeds the capabilities of a Tesla T4, while the runtime was approximately 0.7-3.6 seconds per execution, depending on the length of the model output (the failed attempt at table extraction, which resulted in thousands of spaces of output, is excluded from this consideration).


Both values, memory usage and runtime, are quite high for a single AI model. Thus, if you don't have your own data center and want to host DocOwl 1.5 in the cloud, the cheapest option would be a machine with two Tesla T4s, at a total cost of around €600 per month per model instance (less any discounts if applicable). And that's with an expected runtime of around 6-24 seconds for common queries. To achieve reasonable runtimes, you would have to switch to more powerful GPU models such as the A100, which, however, costs €1000-2000 per month, depending on long-term discounts and the hardware peripherals required.


Conclusion


After examining several use cases on different document types, a mixed and rather disappointing picture of the performance of DocOwl 1.5 emerges. In the case of a good scan and perfect PDF exports, the quality of the reading results is quite acceptable, even if there were isolated reading errors and multiple localization errors. Reading isolated tables also worked well in some cases, but in one case it led to an instability of the model. Key information was extracted well in many cases, but the result is not very reliable, since the sender and recipient were confused in both cases examined.


Based on the results shown in the publication on DocOwl 1.5, I would have expected that perfect PDF documents would work flawlessly. Overall, DocOwl 1.5 seems more like a first proof of concept that shows the basic feasibility of general document recognition, but still needs some work before it can be used productively. After an initial look at the training data from DocOwl 1.5, I suspect that this work will have to be done primarily on the data side (diversification and quality assurance).


I would also take a critical view of the economic viability of DocOwl 1.5. In order to build an API that offers reasonable runtimes for customers (leaving aside the question of whether customers would be satisfied with the reading performance at this point), you would have to resort to expensive GPU models, which can quickly lead to several thousand euros in running costs per month for a few hosted model instances. In my opinion, such amounts would be difficult to pass on to customers.

0 views0 comments

Recent Posts

See All

Comments


bottom of page