ETL Process Enhancement

Azati enhanced the Extract, Transform, Load (ETL) process for a healthcare customer by identifying and eliminating issues related to incomplete or inconsistent reference data from multiple operational systems. The customer is a US national leader in customized insurance, claims, and patient safety & risk solutions for healthcare professionals and facilities.

Discuss an idea

All Technologies Used

Oracle
Oracle
Oracle SQL
Oracle SQL

Motivation

The ETL process receives data from multiple sources, some of which provide incomplete or inconsistent information. The goal was to ensure that the ETL process delivers the most complete, accurate, and consistent data without introducing empty or incorrect values from less detailed sources.

Main Challenges

Challenge 1
Data Inconsistency

Data from different sources varied in terms of completeness and detail. Incomplete data was being loaded into the data warehouse, potentially affecting data quality and reliability.

Challenge 2
Duplicate and Redundant Data

In certain cases, more detailed attribute values were overwritten by empty or less detailed values, leading to duplication and redundancy in the data warehouse.

Key Features

  • Survivorship Matrix: A clear set of rules for deciding whether to keep or overwrite data attributes based on source priority, which improved data quality and consistency.
  • Data Deduplication: Eliminated redundant data overwriting, ensuring that only the most reliable and complete data was loaded into the data warehouse.
  • Scalability: The solution is easily scalable, as new attribute priorities can be added without requiring changes to the core ETL process.
  • Performance Optimization: Achieved the goal of reducing the ETL process running time to under 5 minutes, significantly improving operational efficiency.

Our Approach

Analysis and Prioritization of Attributes
We began by analyzing all the attributes from every source and assigning priority to each attribute based on the source's reliability and completeness.
Survivorship Matrix
We developed a Survivorship Matrix to clearly define whether to keep or overwrite an attribute value based on the source’s priority. This logic was incorporated directly into the ETL process, reducing the need for pre-processing.
Data Deduplication and Consistency
By implementing the Survivorship Matrix, we ensured data completeness, consistency, and reliability, while eliminating unnecessary overwriting of values. The approach also increased flexibility, allowing priority values to be easily modified if needed.
Performance Optimization
We ensured the total running time of the ETL process met performance requirements by reducing the time from over 30 minutes to less than 5 minutes, achieving a significant performance boost.

Project Impact

Improved Data Quality: The solution ensured that only the most complete and accurate data was loaded into the data warehouse, improving the overall quality and reliability of the information.

Enhanced Flexibility and Scalability: The Survivorship Matrix made it easy to adjust priority rules and scale the system without disrupting the ETL process.

Optimized Performance: The reduction in ETL processing time from over 30 minutes to less than 5 minutes led to increased operational efficiency and responsiveness.

Ready To Get Started

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.