Homepage > Portfolio > ETL Process Enhancement

ETL Process Enhancement

Identifying and eliminating the probable issues of the Extract, Transform, Load (ETL) process for reference data that comes from several operational systems.

CUSTOMER

The healthcare sector is one of the most important domains that impacts the entire global population and is closely linked to the development of any country. It also plays a crucial role in how a country is perceived in maintaining economic stability. As such, healthcare systems form a key part of government strategies across the world, and the level of industry expenditure is projected to increase at an annual rate of 4.1 per cent globally between 2017-2021.

The US national leader in customized insurance, claims and patient safety & risk solutions for physicians, surgeons, dentists and other healthcare professionals, as well as hospitals, senior care and other healthcare facilities.

Objective

The ETL process receives data from multiple sources. Some of them provide more detailed and complete information than the others. In certain cases, the attribute values received from more detailed sources could be replaced by empty values of the same attributes received from less detailed sources. Therefore, the incomplete data might be loaded into the data warehouse.

The primary task was to prevent the ETL process from the issues mentioned above, so that it provides the most complete, accurate and consistent information at the time.

Solution

We addressed the issue by adding a transformation step to the ETL process that clearly defines whether to keep or to rewrite an attribute value.

First, we analyzed all the attributes from every source and set priorities for each of them. Then we developed a Survivorship Matrix, which contains attributes and priority relationships. Its main purpose is to define whether to keep or rewrite an attribute value judging by the priority of the source. The Survivorship Matrix logic is incorporated within one SQL logic, allowing to escape any pre-processing.

By implementing the Survivorship Matrix, we’ve achieved data deduplication, data completeness and data reliability. Such an approach eliminates redundant overwriting and ensures consistency for any period of time. Additionally, we’ve increased flexibility of the system, since the priority values can be easily changed if needed. The solution became easily scalable, as the new attribute priorities are added outside the ETL process. Moreover, the performance requirement of the total running time to be less than 30 minutes was successfully met, as it was reduced to less than 5 minutes.

Technologies

 
 

SCREENSHOTS

Drop us a line

If you are interested in the development of a custom solution — send us the message and we'll schedule a talk about it.