How to apply vision language models to long documents

They are powerful modules that use images as input, rather than text like traditional LLM degrees. This opens up a lot of possibilities, considering that we can process the contents of the document directly, rather than using OCR to extract the text, and then input that text into the MBA.

In this article, I will discuss how you can apply vision language models (VLMs) to long-context document understanding tasks. This means applying VLMs either to very long documents of more than 100 pages or very dense documents that contain a lot of information, such as graphics. I will discuss what to consider when implementing VLMs, and what kind of tasks you can perform with them.

VLMs for understanding long documents — This infographic highlights the main contents of this article. I’ll cover why VLMs are important and how to apply them to long documents. You can for example use VLMs for more advanced Optical Character Recognition (OCR), incorporating more document information into the extracted text. Furthermore, you can apply VLMs directly to document images, although you have to consider the required processing power, cost, and latency. Image by ChatGPT.

Why do we need VLMs?

I’ve discussed VLMs a lot in my previous articles, and explained why they’re important for understanding the contents of some documents. The main reason VLMs are needed is that a lot of information in documents requires visual input to understand.

An alternative to VLMs is to use OCR, then use LLM. The problem here is that you are only extracting the text from the document, and not including the visual information, such as:

Where different text is placed in relation to other text
Non-textual information (basically anything that is not a character, such as symbols or graphics)
Where text is placed in relation to other information

This information is often important to really understanding the document, so it’s often better to use direct VLMs, where you feed the image directly, so you can also interpret the visual information.

For long documents, using VLMs is a challenge, considering that you need a lot of tokens to represent visual information. Therefore, processing hundreds of pages is a big challenge. However, with a lot of recent developments in VLM technology, models are getting better and better and compress visual information into contextually reasonable lengths, making it possible and usable to apply VLMs to long documents for document understanding tasks.

This figure highlights an OCR + LLM approach you can use. You can take your document and apply OCR to get the text of the document. You then enter this text, along with the user’s query into the LLM, which responds with an answer to the question, given the text of the document. If you use VLMs instead, you can skip the OCR step entirely, and answer the user’s question directly from the document. Photo by author.

Optical character recognition using VLMs

One good option for processing long documents, while still including visual information, is to use VLMs to perform optical character recognition. Traditional OCR, such as Tesseract, extracts text directly from documents with only the text bounding box. However, VLMs are also trained to perform OCR, and can perform more advanced text extraction, such as:

Extract price reduction
Explaining purely visual information (i.e. if there is a drawing, explain the drawing with text)
Add missing information (ie if there is a box that says date And an empty field after that, you can tell OCR to extract date )

Recently, Deepseek released a powerful VLM-based OCR model, which has received a lot of attention and traction recently, making VLMs for OCR more popular.

Price reduction

Markdown is very powerful, because you are extracting rich text. This allows the model to:

Provide headers and subheadings
Represent tables accurately
Make text bold

This allows the model to extract more representative text, and will more accurately depict the text contents in documents. If you now apply LLMs to this text, the LLMs will perform much better than if you then apply them to simple text extracted using traditional OCR.

LLM performs better on formatted text such as Markdown, compared to pure text extracted using traditional OCR.

Explanation of visual information

Another thing you can use VLM OCR for is annotating visual information. For example, if you have a drawing that does not contain text, traditional OCR will not extract any information, because it is only trained to extract text characters. However, you can use VLMs to explain the visual contents of an image.

Imagine that you have the following document:

This is the introduction text of the document



This is the conclusion of the document

If you implement traditional OCR such as Tesseract, you will get the following output:

This is the introduction text of the document

This is the conclusion of the document

This is obviously a problem, since you are not including information about the image showing the Eiffel Tower. Instead, you should use VLMs, which will output something like:

This is the introduction text of the document


This image depicts the Eiffel tower during the day


This is the conclusion of the document

If you use LLM in the first text, of course you will not know that the document contains an image of the Eiffel Tower. However, if you use LLM on the second text extracted using VLM, LLM will naturally be better at answering questions about the document.

Add missing information

You can also ask VLMs to output contents if there is missing information. To understand this concept, look at the image below:

Why VLMs are important — This figure shows a typical example of how information is represented in a document. Photo by author.

If you apply traditional OCR to this image, you get:

Address Road 1
Date
Company Google

However, it would be more representative if you used VLMs, which, if asked, can output:

Address Road 1
Date  
Company Google

This is more useful, because we know for any final form, the date field is empty. If we don’t provide this information, it’s impossible to know whether or not the date is missing, OCR couldn’t extract it, or for some other reason.

However, OCR using VLM still suffers from some of the problems that traditional OCR has, because it does not process visual information directly. You may have heard this saying A picture is worth a thousand wordswhich often applies to processing visual information in documents. Yes, you can provide a textual description of a drawing using VLM like OCR, but that text will never be as descriptive as the drawing itself. Therefore, I believe that in many cases it is better to process documents directly using VLMs, as I will cover in the following sections.

Open source models versus closed source models

There are a lot of VLMs available. I follow HuggingFace VLM Leaderboard To pay attention to any new high-performance models. According to this leaderboard, you should choose Gemini 2.5 Pro or GPT-5 if you want to use closed source models through an API. In my experience, these options are great, and work well for understanding long documents, and working with complex documents.

However, you may also want to use open source models, for privacy, cost, or to have more control over your own application. In this case, the SenseNova-V6-5-Pro tops the leaderboard. I haven’t tried this model personally, but I’ve used the Qwen 3 VL a lot, and have a good experience with it. Qwen also released a A specific cookbook for understanding long documents.

VLMs on long documents

In this section I’ll talk about applying VLMs to long documents, and the considerations you should make when doing so.

Addressing energy considerations

If you’re running an open source model, one of your main considerations is how big of a model you can run, and how long it takes. You’re counting on access to a larger GPU, at least the A100 in most cases. Fortunately, this is widely available and relatively cheap (usually costing $1.5-$2 per hour for many cloud providers now). However, you should also consider what latency you can accept. Running VLMs requires a lot of processing, and you should consider the following factors:

What is an acceptable amount of time to spend processing one application?
What image resolution do you need?
How many pages do you need to process?

If you have a live chat for example, you need fast processing, but if you are processing in the background, you can allow for longer processing times.

Image resolution is also an important consideration. If you want to be able to read text in documents, you need high-resolution images, usually greater than 2048 x 2048, although this naturally depends on the document. Detailed graphics, for example, containing small text, will require a higher resolution. Increasing accuracy significantly increases processing time and is an important consideration. You should aim for the lowest possible resolution that will allow you to perform all the tasks you want to perform. Moreover, the number of pages has a similar consideration. Adding more pages is often necessary to access all the information in a document. However, the most important information is often included early in the document, so you can get away with only processing the first ten pages, for example.

Answer to dependent processing

Something you can try to reduce the processing power required is to start simple, and only progress to heavier processing if you don’t get the answers you need.

For example, you could start by just looking at the first ten pages, and see if you are able to correctly solve the task at hand, such as extracting a piece of information from a document. Only if we cannot extract the information do we start looking at more pages. You can apply the same concept to the resolution of your images, starting with the lowest resolution images, and moving up to the highest resolution required.

This hierarchical processing reduces the processing power required, as most tasks can only be solved by looking at the first ten pages, or using lower resolution images. Then, only if necessary, we move on to processing more images, or higher resolution images.

It costs

Cost is an important consideration when using VLMs. I process a lot of documents, and I usually see a 10x increase in the number of tokens when using images (VLMs) instead of text (LLMs). Since input tokens are often the cost driver in long document jobs, using VLMs usually increases the cost significantly. Note that for OCR, the point about distinct input tokens does not apply any more than to distinct output tokens, since OCR naturally produces a lot of distinct output tokens when outputting all the text in images.

Therefore, when using VLMs, it is extremely important to maximize your use of cached tokens, which is a topic I discussed in My last article is about optimizing LLMs in terms of cost and response time.

conclusion

In this article I discussed how you can apply vision language models (VLMs) to long documents to handle complex document understanding tasks. I’ve discussed why VLMs are important, and techniques for using VLMs in longer documents. You can for example use VLMs for more complex OCR, or apply VLMs directly to longer documents, taking precautions regarding required processing power, cost and response time. I think VLMs are becoming more important, which is highlighted by the recent release of Deepseek OCR. So I think VLMs for document understanding is a topic you should be involved in, and you should learn how to use VLMs for document processing applications.

👉 Find me on social media:

📩 Subscribe to my newsletter

🧑‍💻 Contact us

🔗 LinkedIn

🐦 x/twitter

✍️ Mediation

You can also read my other articles: