This entry is the first installment of a series where we plan to share the ETL framework and standards, reusable utilities and general guidelines that Just Analytics has developed over multiple years of experience in executing small to large size BI/DW projects. The framework and approach also factors in vast amount of consulting experience our team had before joining Just Analytics
Data Integration (or ETL) has been core component of any analytics project. No matter what being the size of the solution, there will be some element of data integration that will be involved. The data integration problem may range from doing simple cloning of data (one-to-one) from upstream sources with optional layer of aggregates or pre-computed data set or the one that involves loading data into full-blown data warehouse which may be underpinned by 3NF foundation layer and downstream data marts generally using star schema modelling approach.
It has always been a challenge to keep the ETL design and approach streamlined across different projects. More often one ends up taking series of design decision around ETL without factoring in much of past learnings. This series is an attempt to extract and present best of the deign practices from our project execution experiences. At high level the key aspects that goes in as part of ETL design and build is as follows:
- Decision on Hand-coded ETL scripts versus ETL/ELT Tools to build core ETL logic: This is a topic in itself. Few years back I wrote an entry on this. Most of the points out there are still relevant. An additional decision that goes along with choosing the tool based approach would be opt for ELT tools like (Oracle Data Integrator –ODI, and Oracle Warehouse Build OWB) or ETL tools (Informatica, Data Stage). The framework and utilities that we are describing in this series have been deployed along side of both Hand-coded ETL programs written in PL/SQL, and those written using ETL tools.
- Data Flow: We have seen different approaches around this. One of this being to follow reference architecture from Oracle, where data flows from Source to staging to Foundation Layer and then to data marts or Access and Performance layer. There are implementations where data just takes one hop – this is from source to target data model.
- Change Data Capture: This deals with the approach of identifying the delta that needs to be loaded into the analytics system (this can be staging area of a full below Enterprise data warehouse or departmental data mart directly loading data from underlying sources). The changes on the source can be identified and loading to target warehouse either using
- Timestamps/audit columns/status or version indicators stamped on the records source system
- Triggers or Events that are put as a hook before the transactions committed in the OLTP systems
- Log Scanner: This uses underlying databases redos/archive logs and similar logs maintained in other database. This is least invasive to database and typically allows you to preserve transaction integrity
- Design decision around actual ETL Logic: This are specific or point design decisions pertaining to actual ETL logic. Few items that form part of this are: Naming convention of ETL artefacts, reusable components and transformation, flexibility around making the ETL logic configurable or parametrized, specific design patterns to be followed: partial commit, ability to rerun the jobs with or without manual cleanup of the data or control tables, approach for merging the data etc.
- Runtime Audits and Control Tables: An important aspects of ETL design is to maintain detail runtime stats of ETL jobs including the duration of the data loading (for overall batch, job, and individual steps), record count (successful and failure), status of the job and error messages. Control tables are used typically to track the batch status, the last successful business date/time for which data was loaded and other control information which gets used for subsequent run.
- Error Handling: Though to some degree this forms part of runtime audit, this item stands out on its own. Error handling of ETL jobs may range from just capturing the first error message in runtime audit or keeping detail and validation information of each and every row and column.
- Data Quality and Technical Reconciliation: Any ETL design should have assurance framework or technical recon which ensures the data flow is working correct on daily basis, and there are no data loss when data is moving from one stage to other. The technical recon may include doing a simple count check on target versus source table or may even be a detail column to column recon between source and target data set after applying all the relevant business rules on raw source data before matching with target dataset
- Scheduling and Triggering of the jobs: This deals with scheduling of ETL jobs using available scheduling option. This item also deals with orchestration of ETL jobs and control flow and handling of failures. Jobs may be scheduled either using enterprise wide scheduling tool (example: Control M) or using specific scheduling capabilities of ETL/ELT tools, Unix level scheduling (crontab) or DB level scheduling (assuming the ETL jobs are build using procedure or DB level scripts).
About this series:
This series of blog entries is meant to provide details of some of the design items listed above. We will be focusing on ETL design guidelines for loading data into relational databases (however most of the design principles are relevant across the technologies and databases).
The ETL framework and utilities that Just Analytics has built are on Oracle database, however it can easily be ported to any relational database.
Next few entries: The next few entries in this series will be covering following:
- JBS (Job Management system): This is Runtime Audit framework that captures detail runtime information. This falls under the design item “Runtime Audits and Control Tables”
- Oracle to Oracle data loading utility: This is a generic framework which allows cloning data from another Oracle data source. The utility is well suited to populate staging tables rather then having individual ETL programs to populate each and every staging table. This falls under the design item “Design decision around actual ETL Logic”
- Data Quality and Reconciliation Framework: This is comprehensive data recon framework which does both summary and detail level comparison of source and target data. This is part of “Data Quality and Technical Reconciliation”
- Post Load Processing (PLP) Engine: A flexible framework that allows doing population of aggregate or post-process tables which are typically to be rebuild after. This falls under the design item “Design decision around actual ETL Logic”