Take on the fight with Python

Take on the fight with Python

Introduction

If there is one thing common in everyone’s career than it is that at some point everyone said: “Can’t this be automated or made simple?”. On a given moment at my customer I was asked to do a data-compare of two data providers with over 50 characteristics each and containing millions of records in SAP HANA. At first I started to run queries against the SAP HANA-tables to check the data but soon enough I thought…

“Is it possible to automate this or made more simple?”

I had the privilege of already been introduced to Python and a number of libraries which made it possible for me to investigate this further and I will tell you what I did to get to a solution that helped my colleague in fixing the inconsistencies between the two data providers. He mentioned to the team when he got asked if my efforts in Python helped him with a clear answer, being…

“Yes, it certainly helped because, with working this way, everything can be checked and not just some random samples.”

Technical aspects

Before explaining what I did to help my colleague I will first tell something more about the technical aspects of the tools used.

  • SAP BW 7.4
  • SAP HANA Studio 2.3.25
  • SAP HANA 1 SP 12 database
  • Python 3.6 with following libraries:
    • csv
    • sqlalchemy (generic connection library)
    • pyhdb (dedicated connection library to HANA DB)
    • pandas
    • pyhdb
  • Eclipse Jee Oxygen

I will not be going deep into the technical setup of the solution and the tools but for my purpose I created a dedicated class with the actions I wanted to perform for my check and a class that handles the I/O of the solution. I also created a table that will track the comparison across separate runs with a top 3 result of all the characteristics and key figures checked.

Building the solution

To make my life even easier I searched for possibilities to do everything automatically from searching which columns should be checked, apart from the keys I wanted to use to check to checking the data itself based on some logic that I will explain later to the logging of the top-3 data differences.

Step 1: Getting the columns needed to know what needs to be checked

This is easily done by creating a union-query between the columns-table and the measures-table while providing the catalog-names and the table-names.

SELECT

*

FROM(SELECT

“T1″.”DIMENSION_NAME” AS “OBJECT”,

’DIMENSION’ AS “CLASSIFICATION”

FROM “_SYS_BI”.”BIMC_DIMENSIONS” AS “T1”

INNER JOIN “_SYS_BI”.”BIMC_DIMENSIONS” AS “T2” ON “T1″.”DIMENSION_NAME” = “T2″.”DIMENSION_NAME”

AND “T2″.”CUBE_NAME” = ‘{}’

AND “T2″.”CATALOG_NAME” = ‘{}’

AND “T2″.”DIMENSION_NAME” != ’Measures’

WHERE “T1″.”CUBE_NAME” = ‘{}’

AND “T1″.”CATALOG_NAME” = ‘{}’

AND “T1″.”DIMENSION_NAME” != ’Measures’

UNION SELECT

“T1″.”MEASURE_NAME” AS “OBJECT”,

’MEASURE’ AS “CLASSIFICATION”

FROM “_SYS_BI”.”BIMC_MEASURES” AS “T1”

INNER JOIN “_SYS_BI”.”BIMC_MEASURES” AS “T2” ON “T1″.”MEASURE_NAME” = “T2″.”MEASURE_NAME”

AND “T2″.”CUBE_NAME” = ‘{}’

AND “T2″.”CATALOG_NAME” = ‘{}’

AND “T2″.”MEASURE_NAME” != ’1ROWCOUNT’

WHERE “T1″.”CUBE_NAME” = ‘{}’

AND “T1″.”CATALOG_NAME” = ‘{}’

AND “T1″.”MEASURE_NAME” != ’1ROWCOUNT’)

ORDER BY “OBJECT”‘.format(pTable2,

pCatalog2,

pTable1,

pCatalog1,

pTable2,

pCatalog2,

pTable1,

pCatalog1)

The best part at this query and the way I am working is that I can provide my function ANY catalogs and ANY tables to get the mutual columns and start checking with these. This query was read with the read_sql_query-function of pandas to get a list of all the columns that needed to be checked.

As said before, this list contained more than 50 columns that needed to be cross-checked between different data providers.

Step 2: Prepare yourself to write new lines in your own defined snapshot-table

This query was not made dynamic as you know where your snapshot-table is located and what the column-name is of the column that stores your “RUN_ID”. After getting the max-number of “RUN_ID” it was easy to get the next value to be used for tracking the comparisons.

vLatestRunId = vLatestRunId.get_value(col=‘run_id’,index=0)

if vLatestRunId == 0:

vLatestRunId = “{}”.format(1)

else:

vLatestRunId = vLatestRunId + 1

Basically above tells us to get the value for column ‘run_id’ and if this value = 0 then the ‘run_id’ is equal to 0 and else it is equal to the current maximum ‘run_id’ + 1. This value will be used later on to write the comparison-results to the database-table.

  1. Perform the data check based on a certain logic

I mentioned before that logic had to be implemented, being the one below where I need to clarify that we had three keys. Maintaining the secrecy of the project I will provide three different key-fieldnames, being: Customer, Location, Address

  • If “Customer” is null then “Location” and “Address” are the keys that are used to match the two tables
  • If “Customer” is not null then only “Customer” is used to match the two tables

After experimenting and fine-tuning the dynamic SQL-queries used I was able to translate this to a Python way of working to be able to automatize this for every column in the previously made list. This means that by using a for-loop over the list I was able to push a number of queries to the database and write away the results into a new table. I combined my query to have immediately the differences written away into the snapshot-table.

Because of the difference in logic I had to provide two SQL-scripts that probably could be combined in one SQL-script as well but for readability splitting it up was much more convenient. One of the SQL scripts performed the following steps:

  1. Check if “Customer” is null and if so join on the two other key-columns
  2. Check the data based on the two other key-columns and the current column in the loop
  3. Insert a random top 3 of the results in a snapshot-table if there were differences

The combined SQL was saved into a variable and was pushed into the HANA database which made me most likely the biggest user of the database with let’s say 100 queries in 5 minutes. If I look back at this I can’t keep myself from thinking…

“I cannot imagine how I would have done this without the help of Python and its libraries!”

  1. Provide my colleague the results

The comparison was performed, the snapshot-table was updated with new records so now I was able to create a SAP HANA calculation view to show my colleague the results and based on a count, I did not mention before, he could make a list of the columns that needed most attention. It is clear that a column with 75% of differences has a much more importance than a column with 2% of differences. This calculation view was by the way not complex as shown below.

 width=

  1. Investigate the difference

My colleague was able to investigate the most important data differences based on the queries he ran on above view and he could start developing a solution for the data differences. After doing a first round of analysing and solving my colleague was able to put the data back into the test-environment and the data-comparison could start over. You can’t imagine how happy I was to have written this automatic way of data comparison because now I just had to do one click on the execute button any simply lay back and wait on the results to be finished.

Conclusion

It is only fair to say that it took me some days to get the entire Python-script up-and-running but I am absolutely sure that if I did not invest this time I simply was not able to do the data comparison myself in the same amount of time. At this point I can simply click one button and get a whole new analysis of my two tables or even ANY two tables with matching column-names. I think it is fair to say that everyone at our customer thought the same thing…

“I cannot imagine how I would have done this without the help of Python and its libraries!”

Future improvements

Without showing the back of my tongue there are thoughts to even improve the above by creating a GUI where two tables can be chosen and where a mapping-table can be provided for non-mutual columns to perform the data comparison. The other aspects stay pretty much the same with writing away into a snapshot-table to improve reporting on the data comparison.