Skip to main content

SQL Server ETL for Data Lineage

What is data lineage?
Ok let us suppose you built a wonderful database with loads of data coming from source files from your vendors, your own AS400, and different departments of your own organization. All of this data has been imported into 200 tables in this database producing a plethera of information that is used for reporting purposes. One day a department head comes to you and says, "I think this number here on this report is wrong. Where did you get it from?" Do you escort that person to the computer room and show him/her your server? No, I didn't think so. So how do you come up with where the information came from? Likely you find a data load expert in your IT department and have him/her spend the next several minutes/hours/days rummaging through stored procedures, ssis packages, dts packages, custom applications, etc trying to find this information. However, if you had this:

you could simply pull up the history of how that data came to be loaded and point. How useful is this? Well it would save that poor data expert of yours minutes/hours/days per senseless inquiry where he/she could be cranking out the next great package. Basically data lineage is invaluable.
How hard is data lineage to come by? Well unfortunately right now you don't have an easy solution in SQL Server. There are plenty of examples on how to record your source information to the destination in separate tables in sql server. Does this give you user friendly access to your data lineage no. You can spend a fortune on products like informatica which admittedly do a wonderful job of graphically representing your data lineage but what if you are on a budget? What if you are in a small shop and the owner keeps coming to you several times a day/week/month asking questions about where you came up with numbers on his/her report? Sadly, I have led you to this climatic point only to tell you there is no answer that I would hang my hat on. There is open source data lineage that is graphical in nature but the last time I looked at it I would liken it to heiroglyphics drawn by our ancestors on the walls of caves. Pathetic and lacking. I have asked microsoft to include data lineage in future releases of SQL Server. I haven't heard a word back on whether this will happen or not. My guess is not anytime soon but we can still be hopeful.

Comments

Popular posts from this blog

Why fragmentation occurs and how to avoid/fix it.

Let's suppose you have this table: We have made last name the primary key in this table Now lets import some rows into the table and check the fragmentation Notice the fragmentation at 96.48% and all we did was one import of 16426 records. So now we have several questions: How bad is this? Fragmentation causes SQL Server to skip around to read data from your tables. For a one time read on a small table this is virtually meaningless. In a job that may need to read millions of records, this could tear apart your performance. Ok this is bad. How did it happen? The primary key on a table is clustered. This means the data will be stored according to the primary key. In our case, last name. When data doesn't come in the same way you store it (i.e. Customer names are not coming in alphabetically, but rather randomly) SQL Server must constantly split pages to store the data correctly. This causes data to be "fragmented" into multiple areas of the disk instead of one continuou...

DBA 101 - Connecting to an unresponsive SQL Server

I will attempt, over the course of many blogs, to tackle troubleshooting for a beginning to intermediate DBA. Troubleshooting is like an octopus with a hundred arms. There is no silver bullet but at least I can give you some tools for your belt to help determine the next steps in troubleshooting many common problems that you will see. So where do we begin? I don't know. Let's dive in and see where we end up. Problem - Nobody can connect to the SQL Server and it is not responding to any requests. Wow this seems like an impossible problem and is in reality two problems. Lets address the most critical problem which is you cannot even address why the SQL Server isn't responding to requests because nobody can connect to it to see what is going on. A weak solution - Often times an inexperienced DBA or what is often called an "accidental" DBA would pull the plug on the server, wait ten seconds, and then power it back on. Now this isn't the worst possible solution.....