Blogs

BODS Data Extraction

14 september 2018

Loading data into a DWH can sometimes be a hassle. Defining your delta too. So let’s keep things simple. There are numerous ways to define a delta and load data into a DWH. But let’s highlight some simple strategies to do so. Eventually we will be able to combine a delta and a full extraction into one single flow by using variables and creating a conditional where clause.

Capture Changes

  • Source based
  • Target based

Source based (Delta Load)

Evaluate the source tables and only extract rows that have changed. Your source must provide some indication of the change. Changes can be indicated by time stamps or can be tracked in change logs.

To extract data from source tables various mechanisms can be used but let’s discuss the following 2 options :

  • Overlap
  • Start and End

Overlap

A filter is used on the source table using its change and/or creation date minus the overlap period.

Let’s get practical and how can this be implemented using BODS?

On job level we create 2 global variables : $G_Delta_Date and $G_Delta_Days.

Assign follwing values to these variables:


These 2 variables can be modified at runtime as you like. If you want to extract data for the last 30 days you will be able to, just by modifying the variables accordingly.

Now assign the variables using a script to construct the starting date to extract source data.

Data for the last 7 days will now be extracted. ($G_Delta_Date = sysdate() – 7)

The where clause in the extracting dataflow will look like this:

Data for the last 7 days will now be extracted. ($G_Delta_Date = sysdate() – 7)

So we discussed how you can extract data with a variable overlap period of your own choice. The main and biggest disadvantage of this extraction method is extracting too much source data and loading too much data to your target which will put your database and ETL engine to work.

Applying the next technique will solve partially this issue. Too much data will still being extracted from the source, but we can prevent too much data is being processed by the DWH target database and ETL server. This is done in BODS by using the target table also as a source table and compare non-key fields from source and target in the where clause of the query transform. Now only the real changes in the data are selected before being transformed and loaded into the DWH.

Start and End

This approach is a little more complicated since we also need a status table to keep track of previous runs. The data will be extracted without overlap. This will limit the number of extracted rows because we take into account earlier – successful – extractions when determining the starting point for extraction.

A status table will be maintained and will keep track of the extraction. This table is used as a logging table and will log the start datetime, the end datetime of the job that is executed and will update the table accordingly with a status.

When starting the ETL job, a lookup is performed to obtain the start date and time of the last successful job.

A basic status table could look like this:

RUN_ID JOB_NAME START_DATETIME END_DATETIME STATUS
1 EXTRACT_DATA 01/09/2018 04:00:00 01/09/2018 04:05:00 READY
2 EXTRACT_DATA 01/09/2018 06:00:00 01/09/2018 06:05:00 READY
3 EXTRACT_DATA 01/09/2018 08:00:00 NULL ERROR
4 EXTRACT_DATA 01/09/2018 10:00:00 BUSY

In this example, job EXTRACT_DATA will begin extracting data from the source again starting from 01/09/2018 06:00:00.

In BODS the lookup is done by using a script where I assign the lookup value to a global variable.

The global variable – $G_Delta_Date – is used in the where clause when the dataflow is being executed like in the overlap example.

Please note that there are numerous other ways to define a delta.

Target based (Full/Initial Load)

Extract all data and compare source and target table to determine which rows have changed and only load new and changed rows.

When extracting ALL data, NO where clause will be used in the BODS flows. Basically every time a full/initial load is performed.

There are multiple ways in how BODS flows can be designed to process data:

  • Full pushdown
    • All operations which can be pushed down between source and target are pushed down to the database
    • Auto correct load (same as table comparison)
  • Partial pushdown
    • A SELECT statement to the database is generated
    • All data is being transformed first by the BODS server and loaded into the database afterwards
    • Table comparison compares 2 sets of data and outputs the difference and marks rows as INSERT, UPDATE or DELETE

Combining Delta and Full/Initial Load in 1 dataflow

At first sight you would think you should develop 2 seperate flows in order to extract data from source tables either in delta or in full/initial. But this can actually be done in one and the same flow.

How is this done in BODS?

Just by creating 1 extra global variable: $G_Load_Mode.

This load mode can have the values DELTA or INITIAL for example. Depending on the load mode ALL data or a SUBSET of data will be extracted from the source.

This variable can be changed at runtime.

 

The where clause will also look quite different. It can be described at best as a kind of conditional where clause.

Extra tip: the 3 global variables mentioned in this article can also be defined as substitution variables, in that case they shouldn’t be created in every single job.

  • Blogs

    Open Hub Destination with BEx Query as an InfoProvider

    Introduction

    A SAP BW system is mostly known to consolidate data to give reporting or analytics capabilities to your business. However, there could be occasions where you want to transfer the data. One of the options to do this is to make use of Open Hub Destination. This was in the past a part of the InfoSpoke, but separated since SAP BW 7.X. The Open Hub Destination allows you to distribute data in database tables or flat files in a fully integrated and controlled SAP BW data flow:

    Read more >
  • Blogs

    Getting started with SAP HANA SDI in less than a day

    (The Twitter case) Part 2

    I was recently asked by a client’s marketing department to figure out how to bring Twitter data to their reporting environment. They are currently using a BW-on-HANA, so I immediately thought of HANA Smart Data Integration (SDI) as a possible candidate. In this blog I’ll cover the basics for the setup and show you that it’s actually quite easy to get up-and-running with SDI. In fact: I was pulling data from Twitter less than 3 hours after starting this SDI adventure 😊.

    Read more >
  • Blogs

    Getting started with SAP HANA SDI in less than a day

    (The Twitter case) Part 1

    I was recently asked by a client’s marketing department to figure out how to bring Twitter data to their reporting environment. They are currently using a BW-on-HANA, so I immediately thought of HANA Smart Data Integration (SDI) as a possible candidate. In this blog I’ll cover the basics for the setup and show you that it’s actually quite easy to get up-and-running with SDI. In fact: I was pulling data from Twitter less than 3 hours after starting this SDI adventure 😊.

    Read more >