BODS Data Extraction
BODS Data Extraction
Loading data into a DWH can sometimes be a hassle. Defining your delta too. So let’s keep things simple. There are numerous ways to define a delta and load data into a DWH. But let’s highlight some simple strategies to do so. Eventually we will be able to combine a delta and a full extraction into one single flow by using variables and creating a conditional where clause.
Capture Changes
- Source based
- Target based
Source based (Delta Load)
Evaluate the source tables and only extract rows that have changed. Your source must provide some indication of the change. Changes can be indicated by time stamps or can be tracked in change logs.
To extract data from source tables various mechanisms can be used but let’s discuss the following 2 options :
- Overlap
- Start and End
Overlap
A filter is used on the source table using its change and/or creation date minus the overlap period.
Let’s get practical and how can this be implemented using BODS?
On job level we create 2 global variables : $G_Delta_Date and $G_Delta_Days.
Assign follwing values to these variables:
These 2 variables can be modified at runtime as you like. If you want to extract data for the last 30 days you will be able to, just by modifying the variables accordingly.
Now assign the variables using a script to construct the starting date to extract source data.
Data for the last 7 days will now be extracted. ($G_Delta_Date = sysdate() – 7)
The where clause in the extracting dataflow will look like this:
Data for the last 7 days will now be extracted. ($G_Delta_Date = sysdate() – 7)
So we discussed how you can extract data with a variable overlap period of your own choice. The main and biggest disadvantage of this extraction method is extracting too much source data and loading too much data to your target which will put your database and ETL engine to work.
Applying the next technique will solve partially this issue. Too much data will still being extracted from the source, but we can prevent too much data is being processed by the DWH target database and ETL server. This is done in BODS by using the target table also as a source table and compare non-key fields from source and target in the where clause of the query transform. Now only the real changes in the data are selected before being transformed and loaded into the DWH.
Start and End
This approach is a little more complicated since we also need a status table to keep track of previous runs. The data will be extracted without overlap. This will limit the number of extracted rows because we take into account earlier – successful – extractions when determining the starting point for extraction.
A status table will be maintained and will keep track of the extraction. This table is used as a logging table and will log the start datetime, the end datetime of the job that is executed and will update the table accordingly with a status.
When starting the ETL job, a lookup is performed to obtain the start date and time of the last successful job.
A basic status table could look like this:
RUN_ID | JOB_NAME | START_DATETIME | END_DATETIME | STATUS |
1 | EXTRACT_DATA | 01/09/2018 04:00:00 | 01/09/2018 04:05:00 | READY |
2 | EXTRACT_DATA | 01/09/2018 06:00:00 | 01/09/2018 06:05:00 | READY |
3 | EXTRACT_DATA | 01/09/2018 08:00:00 | NULL | ERROR |
4 | EXTRACT_DATA | 01/09/2018 10:00:00 | BUSY |
In this example, job EXTRACT_DATA will begin extracting data from the source again starting from 01/09/2018 06:00:00.
In BODS the lookup is done by using a script where I assign the lookup value to a global variable.
The global variable – $G_Delta_Date – is used in the where clause when the dataflow is being executed like in the overlap example.
Please note that there are numerous other ways to define a delta.
Target based (Full/Initial Load)
Extract all data and compare source and target table to determine which rows have changed and only load new and changed rows.
When extracting ALL data, NO where clause will be used in the BODS flows. Basically every time a full/initial load is performed.
There are multiple ways in how BODS flows can be designed to process data:
- Full pushdown
- All operations which can be pushed down between source and target are pushed down to the database
- Auto correct load (same as table comparison)
- Partial pushdown
- A SELECT statement to the database is generated
- All data is being transformed first by the BODS server and loaded into the database afterwards
- Table comparison compares 2 sets of data and outputs the difference and marks rows as INSERT, UPDATE or DELETE
Combining Delta and Full/Initial Load in 1 dataflow
At first sight you would think you should develop 2 seperate flows in order to extract data from source tables either in delta or in full/initial. But this can actually be done in one and the same flow.
How is this done in BODS?
Just by creating 1 extra global variable: $G_Load_Mode.
This load mode can have the values DELTA or INITIAL for example. Depending on the load mode ALL data or a SUBSET of data will be extracted from the source.
This variable can be changed at runtime.
The where clause will also look quite different. It can be described at best as a kind of conditional where clause.
Extra tip: the 3 global variables mentioned in this article can also be defined as substitution variables, in that case they shouldn’t be created in every single job.