Azati OCR: How To Extract Data From Passports And ID Cards

Azati OCR: How To Extract Data From Passports And ID Cards

Both commercial and non-profit institutions require fast and accurate identity document processing: these are access control systems and ticket sales, travel visa and credit card issuance, or online identity verification.

The document scanning software allows businesses to solve several problems:

  • Reduce processing time. Everyone is familiar with the tedious waiting at the front desk until an employee leisurely rewrites passport data into a shabby notebook, or manually fills several online forms copy-pasting the data. The passport scanner performs this operation in less than in a second.
  • Reduce the number of input errors. A mistake during a ticket issuance creates problems, and sometimes they can be quite expensive, and significantly reduce customer satisfaction. What is more important, document processing software can be integrated with third-party fraud detection applications to detect fraudulent activities on the fly.
  • Reduce staff qualification requirements. The usage of a passport scanner will partially automate the process of document verification – its authenticity or validity. There is no need for additional staff training.

The data extraction tasks from identity documents are relevant in any field where you need to quickly and with minimum of errors input the ID data.

With the help of appropriate solutions, you can accurately find and recognize the series and number of the passport or an ID card, full name, as well as any fields of identity documents.

In everyday activities, quite often, it is necessary to draw up the same type of documents. Of course, this process does not take much time, but there is a high probability of errors due to “manual” data extraction and entry. It may lead to critical consequences when it comes to passport data, where each character plays an important role.

In order not to waste time and reduce the number of errors, we are happy to introduce you the Azati OCR (Optical Character Recognition) engine powered by machine learning.

Let’s have a closer look at the main features:

  1. Hand-crafted machine learning techniques ensure intelligent text recognition;
  2. OCR uses a flexible system of automatically recognized templates;
  3. We can apply the engine to any on-paper documents including technical documents: industrial plans, various diagrams, graphs, and charts;
  4. High accuracy during recognizing objects of high complexity – up to 97%, and up to 98.8% when recognizing plain text;
  5. The recognition rate grows as the number of documents is increasing;

How Azati OCR works:

Typical existing OCR solutions, in most cases, work as follows.

The first step in the optical recognition process is to use a scanner to process the physical form of the document. After copying all the pages, OCR converts the document into a two-color or, in other words, a black and white version. The scanned bitmap is analyzed for light and dark areas. In this case, dark areas are identified as symbols that need to be recognized and light areas as a background. After that, dark areas are processed to search for letters or numbers.

Existing recognition programs may have different processing methods, but as a rule, all of them include “targeting for one character”, word or block of text. Recognized text is processed using examples of various fonts and text formats.

Recognition is based on the use of feature detection rules regarding the characteristics of a specific letter or number (Intelligent Character Recognition). Software evaluates the document data following the rules on how a letter or number is formed. For example, the capital letter “A” can be stored as two diagonal lines intersecting with a horizontal line in the middle.

Azati OCR is different. While processing your documents, we rely on machine learning techniques and cloud computing.

Now let us briefly explain how Azati OCR works and how it differs.

Step #1: During the first stage, our engineers are training the machine learning models. We need these models to recognize and divide all documents into various categories, for example, divide passports from identity cards.

Each category contains specific repeating fields. Thus, having determined what type of document it is, it becomes possible to create a template.

Step #2: For each group of documents that we identified during the first step, we create a template. Using this template, it becomes easy to process all similar documents (or documents related to this group). To achieve maximum accuracy, our data entry specialists manually map areas of the document.

As an alternative, our engineers have implemented an impressive feature – automatic layout detection. Technology searches for similarities in different documents, processing them separately. After all, OCR combines all the found fragments into a single template.

Of course, this method we often apploe to complex documents where are various graphs or charts. All abbreviations are marked manually in a sample group and then looked up for similarities in all other documents.

Step #3: To achieve maximum accuracy, Azati OCR processes each document several times. After that, the system exports all the extracted data (in the structured or semi-structured form) to any possible format, for example: XML, CSV, JSON, or plain text.

Quality Control: Our specialists select a certain number of documents as a focus group. Team examines these documents manually to determine the accuracy rate. The minimum accuracy rate is equal to 97%. If the required standard is not reached, our specialists re-map the templates and run processing repeatedly.

Our engineers can deploy the OCR engine in every country, or even to a self-made cloud without any access from the Internet. At Azati, we respect user privacy and data security.

How Azati OCR treats Passports and ID cards:

Any identity document contains similar fields: first name, last name, date of birth, and so on. Therefore, our engineers have created pre-built templates for similar documents or documents that look like an ID card.

If not a single template fits, then two possible scenarios follow manual matching or automatic matching:

  • Team applies manual matching when Azati OCR requires human help.
  • Automatic matching is applied when the training model tries to extract all possible information from the document in accordance with all the fragments that it can automatically recognize. Later it expects the user to determine which information is useful and which is not.

Our system looks for the following fields in Passports and ID cards:

  • Document number
  • Surname
  • Given names
  • Sex
  • Nationality
  • Date of birth
  • Signature
  • Date of issue
  • Picture (Photo)
  • Date of expiry

How Azati OCR treats a regular identity card according to a predefined template:

Identity Card

How much does it cost?

Azati OCR is suitable for both large or small companies and startups. Today, there are not many high-quality technologies for optical text recognition on the market. However, our prices are flexible enough to satisfy most customers.

We offer two main ways of calculating the approximate cost:

– Pay-per-Document – you pay for each processed document, depending on the complexity of the document – ideal for many different documents. Our engineers continuously improve the system, and recognition quality increases over time.

– An independent version – we install our engine in your environment at a fixed price and sign a maintenance contract. This option is best for small amounts of well-standardized documents, regardless of complexity.

Unfortunately, we cannot estimate the exact cost, since various factors influence it: the volume of documents processed, their complexity, legal restrictions, and so on.

If you want to calculate the approximate cost specifically for your documents – contact us. You can provide us several sample documents for the calculation, and we will provide you an estimate as soon as possible. There will be no need to pay extra. The cost that we will prepare is the maximum, taking into account all possible factors.

Summary:

Before OCR, the only method of digitizing paper was a manual reprinting. This process took a lot of time, and also often led to printing errors. Using OCR saves time, helps eliminate errors, and minimize effort. The technology allows you to perform actions that are not available for physical copies.

If there are any questions – drop us a line, and we will schedule a free personalized demo.

How our team makes a demo:

  1. You send us a few samples for OCR training.
  2. You send us another group of documents, and we show you how the system processes these documents in real-time.
  3. We tune an engine to decrease the number of errors and run processing for a huge set of documents
  4. Our specialists send you the results, reports, and comments concerning your samples.

If your company wants to digitize a ton of documents but does not know how to do it as efficiently as possible – write to us, and we will speak about it.

Drop us a line

If you are interested in the development of a custom solution — send us the message and we'll schedule a talk about it.