BODS – ETL Framework [Part I]

BODS – ETL Framework [Part I]

Requirements

  • Each execution step in a job should be logged in a logging table
  • Provide a delta and init mechanism
      ‣ Extraction from and to date need to be calculated automatically (moving window)
      ‣ Assign own extraction from and to date
      ‣ Use sysdate and subtract a certain number of days (overlap)
  • Job cannot continue if the same job is already running
  • Restart job automatically a number of times after an error
  • Job should be able to start from point of failure

High Level Overview

 width=

Detailed Overview

 width=

Building Blocks

In order to fulfill all requirements, the functionalities are split into different workflows in BODS. Depending on the scenario, a job can use different workflows/building blocks or use all building blocks.

  • Set the necessary parameters
  • Use the different workflows/building blocks
  • Create your custom ETL flow
  • Run your job

A template job was created, covering all possible scenarios. This is an example job to show how the building blocks can be used which meets all initial requirements for data processing. Depending on the scenario, certain global variables at job level should be entered in combination with the corresponding workflows/building blocks. Or the building blocks/workflows can be kept out of the job design and re-added later when needed.

The possibility to add additional custom logic per job remains open. When it turns out a job has additional and very specific requirements, then a custom logic can be added later to meet the requirements.

The main goals of this approach:

  • Monitoring
  • Faster development
  • Standardize your ETL developments
  • Extract data in a unified way

JB_BODS_TEMPLATE

This is an example job to show how the building blocks can be used which meets all initial requirements for data processing.

 width=

In underneath screenshot you have an overview of the global variables with their possible values.

 width=

 width=

 

Pre Job Checks

Logic is defined in workflow ETL_PRE_JOB_WF.

  • System configuration cannot be null => stop job and throw an error
  • Print global variables as they are defined by user input
  • Check the value of certain global variables
    ‣ When value is different => stop job and throw an error
  • Check if the job is still running by checking the repository table
    ‣ Still running => stop job and throw an error
          ‣ If you are sure the job is in idle state on the server or the repository status is false => go to the data services management console and abort job processing

    • Not running:
        ‣ Update job status in logging table to error where the job still has ‘busy’ processes
        ‣ Continue processing

 width=

 

Date Calculation

First retrieve the status of the previous job execution. This value will be stored in global variable $G_Last_Job_Status. Logic is defined in workflow ETL_CALC_JOB_EXTRACTION_DATE_WF.

 

Last Job Status Conditions

  • ( ) $ _ _ _ =
    • no entries for this job could be found in the BODS job execution logging table
    • we can imply this job will be run for the first time
    • overwrite certain variables in order to load all data
        ‣ $G_Load_Mode => INIT
        ‣ $G_Last_Job_Status => READY
        ‣ $G_Extraction_From_DT => 1900/01/01
        ‣ $G_Extraction_To_DT => sysdate()
  • ( ) ($ _ _ _ = $ _ = ) $ _ _ =
      ‣ Previous job run was successful, job can be run from the very beginning
      ‣ OR you always want to start the job from the very beginning ($G_Recovery = N)
      ‣ And when loading in delta ($G_Load_Mode = DELTA) you only want a subset of the data extracted based on the calculated extraction from/to date

      •  $G_Extraction_From_DT and/or $G_Extraction_To_DT are not set
             ° Dates will be retrieved from the BODS job execution logging table
             ° Take the from date from last successful job execution
             ° Take current date/sysdate as to date
        ‣$G_Extraction_From_DT and/or $G_Extraction_To_DT are filled in
             ° Use this value during data extraction
        ‣$G_Delta_Days is filled in
             ° Use this value to calculate the extraction from date
             ° Retrieved or already filled extraction date will be subtracted with the delta days value
             ° The newly calculated extraction from date will be used during data extraction
  •  ( ) $ _ _ _ = ‘ ’ $ _ (‘ ’)
      ° Previous job run failed and you want to restart the job from the point of failure
      ° Determine job step where the job failed
      ° Retrieve the extraction from and to date used during faulty job execution

 width=

 

ETL Loading

 width=

Workflows are used to organize and structure a job. A workflow can contain multiple other loops, conditionals, workflows, scripts and dataflows.

 width=

 

 

 

ETL_RECOVERY_POINT_WF

 width=

When the previous job run failed and you want to restart the job from the point of failure. The entire job step with all job sub steps and subsequent flows will be executed.

$G_Last_Job_Status = ERROR and $G_Recovery = Y

  • Point of failure will be retrieved
      ‣ This is already done in workflow ETL_CALC_JOB_EXTRACTION_DATE_WF
      ‣ Retrieve the job step where the error occurred
      ‣ Retrieve the extraction from/to date used during the erroneous job execution
  • Check the point of failure with the job step to be executed
      ‣ If they match => reprocess flow and subsequent flows from here
      ‣ No match => skip this workflow

 width=

 width=

 

 

 

ETL_JOB_SUB_STEP_BUSY_WF

  • Provide the workflow with the right parameters width=
  • Insert entry in the logging table ETL_BODS_JOB_EXECUTION_LOG

 width=

Next week, you can read the second part of this blog.

Keep an eye on the website!

Blog by Robby De Backer