Ads 468x60px

Tuesday, September 27, 2011

Learn SSIS : SCD - Slowly Changing Dimensions transformation




Slowly changing dimension transformation was one of the mystery transformation for me till now, as i never had chance to actually implement it. So today i block my time to implement this transformation by taking a simple example.


The concept of slowly changing dimension is explained very well and in detail on wikipedia, i would request readers to have a look at below link if you are not aware of SCD concept.

http://en.wikipedia.org/wiki/Slowly_changing_dimension

Before starting on our example, let me give you some glance of the SCD




 transformation provided in SSIS 2008. This transformation is used to implement Type 1 and Type 2 SCD, for other types we need to add some custom logic to our ETL. (Type 1: Identify the changes and update the record, no history will be recorded, Type 2: Any change identified we expire the old record and create new record with new values, here we save history information in old record)

OK.. Lets take simple Employee dimension example... in this example i am getting EmployeeFeed.xls file as input for my DimEmployee table (which is my dimension table) and i am using SCD transformation to identify any changes and implement DimEmployee as Type 2.

DimEmployee :

Create  table DimEmployee



EmpKey int identity(1,1),

EmpId int,

Name varchar(100),

Designation varchar(100),

City varchar(100),

Phone varchar(10),

StartDate datetime,

EndDate datetime

)

So before we start implementing any SCD, we need to first identify attribute in the dimension table for which we need to track changes. In this example i want to track changes for Designation, City and Phone attributes. I am expecting no change in Name attribute or column.

You might have noticed that there are two columns EmpId and EmpKey why these columns are needed in dimension table??

Ans:
EmpId : This is a Business Key, which uniquely identifies a employee in entire data warehouse system.

EmpKey : Is a Surrogate key, which uniquely identifies record in dimension table, and also its a key to identify historical records.

We also have two more columns StartDate and EndDate, these two columns are used to track time of changes, if EndDate is null it means the record is most recent record in dimension table.

Steps to Implement SCD in a data flow.

1. After we add Source (which is excel in our case EmployeeFeed.xls), we need to add Data conversion transformation, to correct if there are any Data type conflicts.
2. Then we add SCD transformation to Data flow, and this will open SCD wizard, Click next on welcome screen.
 3. On Select a Dimensoin Table And Keys page, select your dimension table in this case its DimEmployee, Map all the columns from source excel to destination DimEmployee table. One important thing here we do is identify Business Key, which in our case is EmpID. Then click Next
 4. On Slowly changing dimension columns page, we need to select appropriate change type of the Dimension Columns and here we have three types :
Fixed Attribute--> No change expected
Changing Attribute --> Changes are expected, but no need to record history, same record will be updated.
Historical Attribute--> If this attribute is changed, old record will be expired (by setting EndDate as current date) and new record will be inserted with new attribute value

In our example, we don't expect any change for Name Attribute hence we selected this as Fixed Attribute, and rest all (Phone,Designation and City ) will be selected as Historical Attribute. Once we are done Click Next

5. On Historical Attribute Options Screen, we have two option, we can use any flag column to show which record is expired and which is most recent and other option is to use StartDate and EndDate, in this example we are using second option, and also selected StartDate and EndDate column appropriately.
6. For all other screens in this wizard just select next, and on last screen select Finish.

That's it.. we implemented SCD transformation...your data flow should look like as shown below.

 If you have noticed, we have two outputs from SCD transformation, New Output and Historical Attribute Output. So if there are any new records which are not present in dimension table those records will be redirected to New Output, and all existing records with some changing attributes will be redirected to Historical Attribute output.

Running the data flow..

I have  9 records in my sample EmployeeFeed.xls file..

So when i run my data flow for first time, all these 9 records will be redirected to New Output  and will be inserted to DimEmployee Table.

Next, I did some changes in EmployeeFeed.xls, Changes are marked in yellow... so there are 4 records which are changed and 2 new records added.



If you can see the data flow, two records are redirected through New Output pipeline and 4 moved through Historical Attribute output, so what happens to those 4 records is we update the EndDate to latest date, then again insert them with new changed attrribute keeping EndDate as null. as shown below.


So that's it... hope you have learned something out of this ...

Thanks...

15 comments:

  1. In scd2 implementation why we use 3 expression transformations

    ReplyDelete
    Replies
    1. There are no expression transformations in SSIS, can you please elaborate your question?

      Delete
  2. Start date , End date and new start dates are same, Can look into this ?

    ReplyDelete
  3. Hello,

    Thanks pointing out the error, the reason why the dates are same is on Historical Attribute option page, I have selected variable to set date as System::CreationDate rather i should have selected System::StartTime.

    I will correct it soon, thanks again for your reply. Please also let me know if you want me to cover few more topics.

    Regards,
    Kutub

    ReplyDelete
  4. Hi
    I don't understand the last part, how does the SSIS know you changed those fields?
    Where does it compare the new file and the old one?
    I am trying this with a sql table, I changed the value with an sql statement, ran the SSIS again and it did not realize I made a difference..
    Your help is highly appreciated
    Thanks!

    ReplyDelete
  5. SSIS compares the input (in my case its excel file) with the existing Dimension table (DimEmployee in my case) based on BusinessKey ( in my case its Emp_Id). Please make sure the business key exist in your dimension table to see differences between input and output of SCD transformation.

    I hope this helps..

    Regards,
    Kutub

    ReplyDelete
  6. where this Historical data will be stored

    ReplyDelete
  7. in destination table, with more rows with same bussines key
    see last excel screenshot.

    correct me if i'm wrong

    ReplyDelete
  8. Yes you are correct, the Historical data will be sotred in the same table but with new record with EndDate, and the fresh record will have End Date as NULL.

    I hope it helps..

    ReplyDelete
  9. is it possible to save changes in different table?

    Thank you for your answer

    ReplyDelete
  10. Yes, but in that case you don't have to use SCD transformation, you can do it by using Lookup transformation, and build your custom logic.

    In this article I have shown and ideal approach to implement SCD type 2 using SCD transformation.

    Thanks

    ReplyDelete
  11. getting The below error
    Cannot map columns of different types.
    Column 'Copy of ID' is of type 'System.Double' and column 'EmpId' is of type 'System.Int32'.
    Column 'Copy of phone' is of type 'System.Double' and column 'Phone' is of type 'System.String'.

    ReplyDelete
  12. getting the below error
    Cannot map columns of different types.
    Column 'Copy of ID' is of type 'System.String' and column 'EmpId' is of type 'System.Int32'.

    ReplyDelete
    Replies
    1. The error simply says you need to convert the datatypes of those two input columns.

      Delete
  13. I think SSIS is one of the best way to solve tricky and complex IT operations.
    SSIS Postgresql Write

    ReplyDelete