Cloud System for
Document Digitization

 

Custom system for engineering drawings digitization powered by artificial intelligence to extractdata from on-paper maps, schemes, and other technical documents.

CUSTOMER:

Together with our Strategic partner DIGATEX, we combined our software and data science skills andtheir domain knowledge of engineering data management to create DI-analytics, a unique solutionto digitising vast amounts of advanced documents for customers who own and operate complexassets such as oil refineries and offshore production facilities.

One of the first customers for this solution is a South East Asia corporation that explores andmanufactures petrochemical products. The company is ranked among Fortune Global 500’s largestcorporations in the world with business interests spanning 35 countries.

Due to specific business demands the customer regularly has to digitise vast amounts of advanceddocuments. The service was provided as an outsourced process comprising document processing,data extraction and collation.

Two main issues required an immediate solution:

Speed up the document processing.

The number of documents wasincreasing faster than the vendor can digitize.

Decrease costs for single document processing.

Extracting suchdata accurately using conventional methods is very costly.

OBJECTIVE:

The objective was to build a solution for digitizing a large number of complex documents in theshortest terms. The majority of documents were pipeline layouts, industrial plans, manufacturingschemes and maps obtained from the third-party vendors and partners.

CHALLENGE #1:

 

All documents from a single vendor or partner can be divided into several groups, and each grouphas multiple document templates. So, hundreds of vendors lead to the thousands of templates.

It is very challenging for a human not only to remember all the templates but also to determinethe right template that suits the specific document. The first concern we faced was to determinethe document template, to know what kind of data to extract.

CHALLENGE #2:

 

Another challenge was to extract the data from the technical documents. Every template has it isown unique set of fields, custom abbreviations and unique symbols in addition to the flexiblestructure.

Our goal was to make the application to determine the zones and fields of the documentautomatically, without manual mapping. It is a very challenging process if we take intoconsideration the number of various templates.

CHALLENGE #3:

 

The majority of schemes and plans were autogenerated by other software applications in multiplesteps. It was the usual situation when the information we need to extract lies under anotherelement or abbreviation. It is challenging even for the human to read some schemes.

Our engineers decided to train the machine learning to recognize complicated elements accordingto its previous experience and already extracted data.

PROCESS:

After initial research, we figured out – no existing technology can help us to overcome thecustomer challenges. Several companies provide similar services, but their products are entirelyunsuitable for the documents with flexible structure and industrial maps.

Our engineers DECIDED TO build a custom Optical Character Recognition (OCR) Engine powered byArtificial Intelligence.

AI was a good option – it acts like a human, and it uses the same algorithms and methods whilesearching the data patterns in the document as the human does.

The solid scientific background helped our engineers to build MVP in less than two weeks. Weimmediately requested the first documents from the customer and got a predictable result, thatimpressed the customer.

We processed about 10.000 documents in less than 8 hours with the average accuracy of 84%.

Since that moment, we have been tuning algorithms and improving the performance of the system.

Now the accuracy of extracted data is close to 97%.

SOLUTION:

The final application is the entirely modular system, hosted in the secure enterprise cloud. Allon-going tuning and maintenance are entirely remote, what helps the customer to avoid on-sitepersonnel training and cut down maintenance costs.

The system contains four modules:

 

Document
digitization
module
 

Data
extraction
module
 

Error
detection
module

 

Performance
monitoring
module

We are proud to say, that a small group of neural networks powers every single module, and all
the modules form a unique artificial intelligence that takes the document as the input, and
provides the accurately extracted data as the output.

As the artificial intelligence is hosted in the cloud, it can be easily managed from any place.
If the customer wants to process a considerable number of documents in the shortest terms, we
can enable the required resources in several minutes and handle any number of documents.

TECHNOLOGIES:

 
 
 
 
 
 
 
 
 
 
 

SCREENSHOTS:

RESULTS:

We are the first who successfully implemented computer vision (CV) technologies to power a littleartificial intelligence to process a large number of documents with flexible structure andcustom abbreviations.

Our engineers developed custom artificial intelligence that determines the template of thedocument, accurately extracts data, links and maps the data into the complete datasets.

AI significantly improved the workflow:

5times
decreased single DOCUMENT PROCESSING COSTS

And freed 30 people from doing routine work

120KDOCUMENTS
WERE PROCESSED IN LESS THAN 24 HOURS

We made it possible to extract data from documents with flexible struture.

4times
DECREASED DATA EXTRACTION TIME

The project took 6 weeks when customer planned 6 months.

NOW:

At this moment DIGATEX and Azati are maintaining this system as a service. The primary focus istuning the data extractions algorithms, training neural networks to extract the data accuratelyand fast.