Cloud System for
Document Digitization

 

Custom system for engineering drawings digitization powered by artificial intelligence to extract data from on-paper maps, schemes, and other technical documents.

CUSTOMER:

Together with our Strategic partner DIGATEX, we combined our software and data science skills and their domain knowledge of engineering data management to create DI-analytics, a unique solution to digitising vast amounts of advanced documents for customers who own and operate complex assets such as oil refineries and offshore production facilities.

One of the first customers for this solution is a South East Asia corporation that explores and manufactures petrochemical products. The company is ranked among Fortune Global 500's largest corporations in the world with business interests spanning 35 countries.

Due to specific business demands the customer regularly has to digitise vast amounts of advanced documents. The service was provided as an outsourced process comprising document processing, data extraction and collation.

Two main issues required an immediate solution:

Speed up the document processing.
The number of documents was increasing faster than the vendor can digitize.
Decrease costs for single document processing.
Extracting such data accurately using conventional methods is very costly.

OBJECTIVE:

The objective was to build a solution for digitizing a large number of complex documents in the shortest terms. The majority of documents were pipeline layouts, industrial plans, manufacturing schemes and maps obtained from the third-party vendors and partners.

CHALLENGE #1:

 

All documents from a single vendor or partner can be divided into several groups, and each group has multiple document templates. So, hundreds of vendors lead to the thousands of templates.

It is very challenging for a human not only to remember all the templates but also to determine the right template that suits the specific document. The first concern we faced was to determine the document template, to know what kind of data to extract.

CHALLENGE #2:

 

Another challenge was to extract the data from the technical documents. Every template has it is own unique set of fields, custom abbreviations and unique symbols in addition to the flexible structure.

Our goal was to make the application to determine the zones and fields of the document automatically, without manual mapping. It is a very challenging process if we take into consideration the number of various templates.

CHALLENGE #3:

 

The majority of schemes and plans were autogenerated by other software applications in multiple steps. It was the usual situation when the information we need to extract lies under another element or abbreviation. It is challenging even for the human to read some schemes.

Our engineers decided to train the machine learning to recognize complicated elements according to its previous experience and already extracted data.

PROCESS:

After initial research, we figured out - no existing technology can help us to overcome the customer challenges. Several companies provide similar services, but their products are entirely unsuitable for the documents with flexible structure and industrial maps.

Our engineers DECIDED TO build a custom Optical Character Recognition (OCR) Engine powered by Artificial Intelligence.

AI was a good option - it acts like a human, and it uses the same algorithms and methods while searching the data patterns in the document as the human does.

The solid scientific background helped our engineers to build MVP in less than two weeks. We immediately requested the first documents from the customer and got a predictable result, that impressed the customer.

We processed about 10.000 documents in less than 8 hours with the average accuracy of 84%.

Since that moment, we have been tuning algorithms and improving the performance of the system.

Now the accuracy of extracted data is close to 97%.

SOLUTION:

The final application is the entirely modular system, hosted in the secure enterprise cloud. All on-going tuning and maintenance are entirely remote, what helps the customer to avoid on-site personnel training and cut down maintenance costs.

The system contains four modules:

 
Document
digitization
module
 
Data
extraction
module
 
Error
detection
module
 
Performance
monitoring
module

We are proud to say, that a small group of neural networks powers every single module, and all the modules form a unique artificial intelligence that takes the document as the input, and provides the accurately extracted data as the output.

As the artificial intelligence is hosted in the cloud, it can be easily managed from any place. If the customer wants to process a considerable number of documents in the shortest terms, we can enable the required resources in several minutes and handle any number of documents.

TECHNOLOGIES:

 
 
 
 
 
 
 
 
 
 
 

SCREENSHOTS:

RESULTS:

We are the first who successfully implemented computer vision (CV) technologies to power a little artificial intelligence to process a large number of documents with flexible structure and custom abbreviations.

Our engineers developed custom artificial intelligence that determines the template of the document, accurately extracts data, links and maps the data into the complete datasets.

AI significantly improved the workflow:

5times
decreased single DOCUMENT PROCESSING COSTS

And freed 30 people from doing routine work

120KDOCUMENTS
WERE PROCESSED IN LESS THAN 24 HOURS

We made it possible to extract data from documents with flexible struture.

4times
DECREASED DATA EXTRACTION TIME

The project took 6 weeks when customer planned 6 months.

NOW:

At this moment DIGATEX and Azati are maintaining this system as a service. The primary focus is tuning the data extractions algorithms, training neural networks to extract the data accurately and fast.