How To Extract Data From Invoices With Azati OCR

We live in a world full of new technologies, even though huge corporations still require digital transformation. During digital transformation, vendors integrate the latest technologies into all areas of business, fundamentally changing how the company operates and delivers value to customers.

As many companies still store valuable data on paper, it is a widespread issue extract data from documents, especially if there are millions of them. That’s why our engineers recently built a custom OCR engine.

The OCR engine was designed to solve one particular business issue for Oil & Gas industry – extract data from complex documents like pipeline layouts, industrial plans, manufacturing schemes, and maps received from the third-party vendors. As general OCR software available online cannot process these documents, there was no other choice than to build another, but more sophisticated one.

After our team solved the issue, we found that the engine is capable of processing not only complex documents but also invoices, tax forms, loans, shipping orders, price lists and other documents with high accuracy.

Today we are happy to announce the Azati OCR – Optical Character Recognition (OCR) engine powered by machine learning. We already have a few successful integration case studies for companies ranked in the Fortune Global 500.

Let’s have a closer look at the main features:

  1. Hand-crafted machine learning techniques ensure intelligent text recognition;
  2. OCR uses a flexible system of automatically recognized templates;
  3. We can apply the engine to any on-paper documents including technical documents: industrial plans, various diagrams, graphs, and charts;
  4. High accuracy during recognizing objects of high complexity – up to 97%, and up to 98.8% when recognizing plain text;
  5. The recognition rate grows as the number of documents is increasing.

How Azati OCR works:

Unfortunately, a considerable number of so-called OCR systems are human-powered. It means that behind machine learning is a group of data entry specialists that extract all required data manually. There is a small probability that your confidential data can be available to a third party. It is especially critical for companies located in Europe, as there is a General Data Protection Regulation (GPDR) and a governmental institution may fine a customer.

Azati OCR is different. While processing your documents, we rely on cloud computing. Our engineers can deploy the OCR engine in every country, or even in a self-made cloud without any access from the Internet. At Azati, we respect user privacy and data security.

Now let us briefly explain how Azati OCR works.

Stage 1: Our engineers train a machine learning model to recognize and automatically divide the documents into several categories. Our specialists look through these categories to determine groups, that are later used to create templates. For example, one group includes all invoices without forms, while another group consists of all invoices with hand-written signatures.

Stage 2: We create a template for each group, and after that, this template is used to process all documents in this group. To achieve maximum accuracy, our specialists manually map the areas of a document.

Impressive feature: As an alternative to manual mapping, we created automatic layout detection. This technology works with canned pieces of documents. It looks for similarities in different documents and processes these parts separately. After all OCR connects all the pieces found in the single document into an entity.

This method is usually applied to complex documents digitizing like charts. At first, the abbreviations are marked manually, and then these objects are searched in all documents.

Stage 3: Azati OCR processes each document multiple to provide maximum quality and accuracy. As a result, the system exports structured or semi-structured data in XML, CSV or JSON formats.

Quality Control: Basically our specialists select a certain number of documents as a focus group. These documents are examined manually to determine the accuracy rate. The minimum accuracy rate is equal to 97%. If the required standard is not reached, our specialists re-map the templates and run processing again and again.

How Azati OCR treats Invoices:

The majority of invoices contain similar fields, that is why our engineers created several predetermined templates to apply them  to any document that looks like an invoice. If none of the templates match, there are two possible solutions for this issue: manual mapping and automatic mapping.

We should apply manual mapping when a company wants to extract data from custom invoices, and Azati OCR requires human help.

What concerns automatic mapping, learning model tries to retrieve all possible information from a document according to all fragments it can recognize. Later it expects a user to determine what information is useful, and what is not.

Our system looks for the following fields in invoices:

  • Sender
  • Sender address
  • Recipient
  • Receiver address
  • Invoice name and date
  • Product description
  • Quantity of goods
  • Price of goods, total, currency of payment
  • Delivery terms
  • Terms and procedures of payment
  • Form data

How Azati OCR treats a regular invoice according to a predefined template:

How much does it cost?

Azati OCR is affordable for both large companies and startups. As there are not that much high-quality OCR engines on the market, we can say that our pricing is flexible enough to satisfy the majority of the customers.

There are two basic pricing options:

– Pay-per-Document – you pay per each processed document, depending on the complexity of the document – ideal for large quantities of various documents. Our engineer tunes the system continuously, and the recognition quality improves over time.

– Self-hosted Solution – we deploy our engine in your environment at a fixed price and sign an on-premise maintenance contract. This option is more appropriate for small amounts of well-standardized documents regardless of their complexity.

Unfortunately, we cannot expose the actual numbers as there are many factors that affect final costs: processing volumes, document complexity, legal limitations, data transfer, etc.

The optimal way to learn how much does it cost to extract data from your documents – contact us and provide the data samples. Afterward our specialists analyze the data, we will send you a raw estimation. There are no hidden costs, as the price you see is the top price, and you won’t pay extra.


The more significant number of documents our system processes, the higher accuracy rate is. This fact makes it ideal for extracting data from a hundred of thousands, or even millions of documents. If there are any questions – drop us a line, and we will schedule a free personalized demo.

How our team makes a demo:

1. You send us a few samples for OCR training.

2. You send us another group of documents, and we show you how the system processes these documents in the real-time.

3. We tune an engine to decrease the number of errors and run processing once again.

4. Our specialists send you the final results, report, and comments concerning your samples.

Many companies spend millions per year to get rid of on-paper documents, but it seems that this process can take decades. So if your company suffers from issues related to document digitizing – drop us a line and we’ll have a chat on how Azati OCR can help.

Drop us a line

If you are interested in the development of a custom solution - send us the message and we'll schedule a talk about it.