ETL Process Enhancement
2016 till now | Oracle, Survivorship Matrix, Kalido, Dedicated Team
Identifying and eliminating the probable issues of the Extract, Transform, Load (ETL) process for reference data that comes from several operational systems.
The US national leader in customized insurance, claims and patient safety & risk solutions for physicians, surgeons, dentists and other healthcare professionals, as well as hospitals, senior care and other healthcare facilities.
The ETL process receives data from multiple sources. Some of them provide more detailed and complete information than the others. In certain cases, the attribute values received from more detailed sources could be replaced by empty values of the same attributes received from less detailed sources. Therefore, the incomplete data might be loaded into the data warehouse.
The primary task was to prevent the ETL process from the issues mentioned above, so that it provides the most complete, accurate and consistent information at the time.
We addressed the issue by adding a transformation step to the ETL process that clearly defines whether to keep or to rewrite an attribute value.
First, we analyzed all the attributes from every source and set priorities for each of them. Then we developed a Survivorship Matrix, which contains attributes and priority relationships. Its main purpose is to define whether to keep or rewrite an attribute value judging by the priority of the source. The Survivorship Matrix logic is incorporated within one SQL logic, allowing to escape any pre-processing.
By implementing the Survivorship Matrix, we’ve achieved data deduplication, data completeness and data reliability. Such an approach eliminates redundant overwriting and ensures consistency for any period of time. Additionally, we’ve increased flexibility of the system, since the priority values can be easily changed if needed. The solution became easily scalable, as the new attribute priorities are added outside the ETL process. Moreover, the performance requirement of the total running time to be less than 30 minutes was successfully met, as it was reduced to less than 5 minutes.
OracleOracle SQL Trace