How to Extract Metadata from Complex Documents

0 1 6 minutes read

How to Extract Metadata from Complex Documents

values of important information. However, this information is, in most cases, hidden deep in the content of the documents and thus it is difficult to use the functions of DownSTream. In this article, I'll discuss how to automatically extract metadata from your documents, considering the methods of metadata extraction and the challenges you'll face along the way.

The article is a high-level overview of creating a Metadata domain in documents, highlighting the different considerations you should make when creating a metadata domain.

This highlights the main content of this article. I'll start by discussing why we need to extract document metadata, and how it's useful for good practice. To continue, I will discuss ways to extract metadata, with Regex, OCR + LLM, and Vision LLMS. Finally, I will also discuss the different challenges when creating a Metadata domain, such as Regex, handwritten text, and dealing with long documents. Photo by chatgpt.

Why extract document metadata

First, it is important to clarify why we need to extract metadata from documents. After all, if the information is already in the documents, can't we just find the information using rag or other similar methods?

In most cases, the rag would be able to find specific data points, but pre-release metadata simplifies many downstream tasks. Using Metadata, for example, you can sort your documents based on data points, such as:

Document type
Addresses
Dam

In addition, if you have a rag system in place, it will, in most cases, benefit from the addition of Metadata. This is because it presents additional information (Metadata) clearly in the LLM. For example, let's say you're asking a question related to dates. In that case, it's easier to provide pre-release document dates to the model, instead of having the model release dates during monitoring. This saves on both cost and latency, and is likely to improve the quality of your rag responses.

How to Extract Metadata

I highlight three main methods of extracting metadata, going from simple to complex:

Regex
OCR + LLM
Vision LLMS

Regex

Regex is a simple and consistent way to extract metadata. Regex works best if you know the exact state of the data in advance. For example, if you are processing leases, and you know that the date is written as DD.mm.yyyy, always after the words “Date:”, Regex is the way to go.

Unfortunately, most document processing is more complicated than this. You will have to deal with inconsistent documents, and challenges such as:

Dates are written in various places in the document
The text has lost some characters due to poor OCR
Dates are written differently (eg .dd.yyyyy, October 22, December 22, etc.)

Because of this, we often have to go to more complex methods, such as OCR + LLM, which I will explain in the next section.

OCR + LLM

A powerful way to extract Metadata is to use OCR + LLM. This process first applies OCR to the document to extract the text content. Then you take the OCR-ED document and quickly LLM extract the date from the document. This is often very effective, because LLMS is good at understanding context (Which days are valid, and which days are not), and they can have headers written in all sorts of different formats. LLMS will, in most cases, also understand both European date standards (DD.mm.yyyy) and American date standards (MM.dd.yyyy).

However, in some cases, the metadata you want to extract requires physical information. In these cases, you need to use the most advanced process: Vision LLMS.

Vision LLMS

Using Vision LLMS is the most complex method, with the highest latency and cost. In most cases, running a Vision LLMS will cost more than running a text-based LLMS.

When running vision LLMS, you usually have to ensure that the images are of high resolution, so vision llm can read the text of the documents. This in turn requires a lot of visual tokens, making processing more expensive. However, Vision LLMs with high resolution images will often be able to extract complex details, which OCR + LLM cannot, for example, the details given in the image below.

This image highlights the task where you need to use the LLMS view. If you OCR this image, you will be able to extract the words “Draw 1, document 2, document 3,” but the OCR will completely miss the filled checkbox. This is because OCR is trained to extract letters, not numbers, like a checkbox with a circle in it. Trying to use OCR + LLM therefore fails in this case. However, if you instead use the llm view in this problem, you will be able to extract which volume is checked. Photo by the author.

Vision LLMS also works well in handwritten situations, where OCR may be necessary.

CHALLENGES When Extracting Metadata

As I pointed out earlier, documents are complex and come in a variety of forms. So there are many issues to deal with when extracting metadata from documents. I will highlight three main challenges:

When to use Vision vs OCR + LLM
Dealing with handwritten text
Dealing with Long Texts

When to Use Vision LLMS vs OCR + LLM

Ideally, we will use Vision LLMS for all Metadata victims. However, this is often absent due to the cost of Running Vision LLMS. So, we have to decide when to use vision llms vs when to use OCR + LLMS.

One thing you can do is decide whether the metadata is the point you want to extract the physical information or not. If it's a date, OCR + LLM will work well in almost all cases. However, if you know that you are dealing with checkboxes like the example task I mentioned above, you need to install WING LLMS.

Dealing with handwritten text

One problem with the method mentioned above is that some documents may contain handwritten text, which traditional OCR is not particularly good at extracting. If your OCR is bad, LLM Extracting Metadata will also do fine. So, if you know you're dealing with handwritten text, I recommend using Vision LLMS, as it's way better at dealing with handwriting, based on my experience. It is important to know that most documents will contain both digital and handwritten text.

Dealing with Long Texts

In many cases, you will also have to deal with very long documents. If so, you should consider how far into the document the metadata Point may be.

The reason this is considered is that you want to reduce costs, and if you need to process very long documents, you need to have more tokens to enter your LLMS, which is more expensive. In most cases, an important piece of information (a date, for example) will be at the beginning of the document, where you won't need multiple input tokens. In some cases, however, the relevant piece of information may be on page 94, where you need more input tokens.

The issue, of course, is that you don't know in advance what metadata the page is on. Therefore, you actually have to make a decision, like looking at the first 100 pages of a given document, and thinking about metadata is available on the first 100 pages, almost all documents. You will miss a data point on the rare occasion when the data is on page 101 and where it comes from, but you will save a lot of cost.

Lasting

In this article, I discussed how to extract metadata regularly from your documents. This metadata is often tricky when you're doing fine tasks like sorting your documents based on data points. In addition, I discussed the three main methods of extracting Metadata with Regex, OCR + LLM, and Vision LLMS, and covered some of the challenges you will face when extracting Metadata. I think that extracting metadata is always a task that does not require a lot of effort, but that can give a lot of value to good works. In that way I believe that metadata extraction will remain important in the coming years, although I believe we will see more metadata extraction send WING LLMS tools, instead of OCR + LLM.

👉 Find me in the community:

🧑💻 Get in touch

📩 Subscribe to my newsletter

🔗 lickEdin

🐦 X / Twitter

✍️ Medium

You can also read some of my articles:

Source link

nimda 23 hours ago

0 1 6 minutes read