Data warehouse (DW) testing is a far cry from functional testing. As testers, we need to let the team know if the DW dimension, fact, and bridge tables are getting the right data from all the source databases, storing it in such a way as to allow users to build reports, and keeping it current.
We are using automated tests as one of our test approaches. Contrary to popular belief, we found DW testing to be an excellent place to build and reuse automated tests. We came up with a simple automated test template that is allowing testers who have never done test automation to build up a pretty sweet library of automated tests.
Here's how it works:
Our automation engineer built us a test with two main parameters we pass in; a SQL statement for source data and a SQL statement for target data. The test compares the two record sets, including all data values and asserts they are equal. As the programmers build the various DW dimension, fact, and bridge tables, the testers copy and paste said automated test, and swap out the SQL with their SQL for the current test subject.
One of our most important tests is to rebuild the programmer's SQL ourselves, which is a manual process. If the tester's datasets don't match the DW, there may be a problem. This is where most of the defects are found. As we manually build our SQL, we plug it into an automated test that actually compares every record and column data value on millions of rows, and tells us where the differences are. The byproduct is an automated regression test.
Sure, rebuilding much of what the programmers did is challenging. We've had to learn case statements, cross DB joins, data type casting, and keep our heads straight while working through queries that include 30 or more joins. But normally it's fun and easy to get help from Google or our programmers.
We've been getting one or more automated tests completed for each dimension, fact, or bridge table; sometimes as early as in the development environment. When the DW is deployed to various environments, we normally just execute our automated tests and interpret the results. It's quite beautiful.
I’ll blog about some of the other tests we execute. There appears to be a shortage of pragmatic DW test ideas out there.

18 comments:

  1. Anonymous said...

    I've been dying to hear ideas on how others test their DW components.

    My experience varies to yours. No matter how much I try to get automation in, its rarely (read never) worked. It would be fair to say, there hasnt been a project on which we've had the luxury of time.

    I agree with your fundamental approach of checking data at source and DWH, but at times we are delivered ETL *and* front end components such as cubes, drill through reports, dashboards and reports at the same time. I then tend to test it as follows:
    1. validate all dimensions against each measure (at least once) against source.
    2. validate multiple dimensions against a measure to allow maximum overlap and therefore 100% coverage
    3. in case of mismatch, trace it back from presentation to staging to source and therefore help out dev team to fix the issues
    4. in case of ETL, I normally try to validate that all functional requirements are fulfilled - i.e. all facts, dimension, bridge tables have enough information that allows the reports to be generated.

    The fact that we (the company I work for) churn out data warehouses fast, regrettably, little time gets allocated to regression. I've had to take an "exploratory" approach on this to allow individual tester to do their own analysis as a customer would do.

    I'd very much like to exchange ideas on how effective the tests can be...

  2. Mike Girenko said...

    I found this article to be interesting, "Where are the Articles on Data Warehouse Testing and Validation Strategy?", found here: http://www.information-management.com/infodirect/20021004/5855-1.html

    Mike Girenko

  3. Andrei Contan said...

    DW testing is more than just SQLs.
    A datawarehouse involves both Oracle and MSSQL skills, ETL and also programming skills
    I'm working with DW testing for more than a year, and our biggest challenge is Automation. Currently, our only solution is using UTPLSQL framework for building tests for our PLSQL code. We very often run into replication issues which are distinct problems that cannot be tested using UT frameworks.
    At this point, I'm working on my Phd project, it's about High VOlume Automation TEsting in DW environment. I would love to have some knowledge exchange in this particular area, cause as you said, there's not too much work published or studies made on this field.

    Cheers,
    Andrei

    I can be reached on my email address: andrei at qaheaven dot com

  4. Bill H said...

    Eric , that is exactly how we do it. I agree that it is much, much different from functional testing. But we couldn't find a test tool that adequately handled our needs regarding firing off SQL and performing data compares so we built a tool (http://www.rttsweb.com/services/products/querysurge/). Let me know any feedback you may have.

  5. Anonymous said...

    Hey Bill you product looks good. But how is the pricing ?

  6. Bill Hayduk said...

    Roger: ping me personally and we can provide you with pricing.

  7. Bill said...

    Great post, I found it very interesting. Still learning about the warehouse automation industry and the products. Thanks.

  8. D- said...

    Hi Eric:

    I wasn't able to attend the Fall conference in Dallas or Spring conference in New Orleans for STP. I am extremely interested in your track for "Automating Data Warehouse Tests". Currently our team writes re-usable scripts from source to target and perform data type/column compares and counts. We also perform ETL testing using Informatica. We reuse these scripts by changing our parameters based on subject and ETL mappings. The blog already describes exactly what we do. Is there additional information regarding the automation side or other tips or is the reusable scripts considered automation?

    Also I am interested in attending your next conference or webinar if you have one coming up for 2012-2013, please let me know and/or supply venue,dates, and times? If you have additional reading for DW testing, I am also interested. I'm looking forward to hearing from you!

  9. Eric Jacobson said...

    D, Yes...probably. When the inspiration hits me, I'll do more DW posts. Feel free to contact me via the email link on my blog and I can send you my STPCon presentation or discuss more details in the meantime.

  10. Anonymous said...

    Hi Eric , nice to meet you,sorry i could no find your mail id

    am jagan an etl tester,could you please tell me how effectivelyand quickly we can test the etl reports when compared to any database which both has huge millions of records other than the etl testing validation

    my friends are suggesting to learn automation or by using macro tool to compare the databut i dont have any knowledge about it ,i will try to learn but is there any other way to test effectively betwwen the repaorts and database

  11. Eric Jacobson said...

    Thanks for the question, Jagan.

    You can read my 4 data warehouse blog posts. That may give you ideas.

    The typical approach, which does not use automation, is to build a user report then work with users to check the report content.

    If you read this post, you commented on, carefully, you'll notice I am explaining a way to help manual testers use test automation.

  12. Cathie said...

    Hi
    I've created several automated SQL scripts which can be used on any 'sensible' solution. Sometimes they require tiny tweaks, but in the main it is a case of changing the Table_Catalog name only. They run across all tables within a layer and produce a single report with a pass/fail status for each table for the item being tested e.g. Referential Integrity checks for Facts to Dimensions, check for duplicates, checks Type 1 and Type 2 table changes, comparing one table with another to ensure everything has loaded as per expectations. As I say as long as the warehouse is consistent in naming conventions they work. I've used them in several companies on many different projects, so they're definitely robust.

  13. Anonymous said...

    I got a question here, normally in data warehouse we have different databases as source and target systems. So where we should run this stored procedure passing two param as two SQL script which compare two different database tables. Because in that case database tool should have the capability of connecting to both databases

  14. Eric Jacobson said...

    Anonymous, you should probably use a config file in your test project. The config file can have the database connection strings for the source and target. In our case, we used a data reader to compare the source and target record sets row by row on the server side.

  15. François said...

    By now most serious dw/bi testing environments probably have some sort of source-target detailed test tool going, as it is so easy to build. With a reasonably performant dbms it is easy to provide 100% data and business rule coverage. One wants a setup whereby testers need to add the mininum: source and target SQL and DB connections (allowing for cross-db and server tests) and some descriptive info. We have a YAML file format for each test case. This can then go into a version-controlled environment enabling other testers to modify and for a suite of related tests to run for regression testing. We have had this going for quite a while now and it is pretty much indispensable in finding lots of bugs business analysts and developers would never find by manual testing either against aggregated only or a few detail comparisons. Not quite sure why anyone would want to pay money for a commercial tool...

  16. Mathew said...

    Hi Cathie,

    Would love to learn more about how you build your automation framework using Table_Catalog. Could you share some scripts if possible. my contact - dmathewk@gmail.com

  17. Test Data Architects, LLC said...

    Hi Eric,
    Great approach, I did something similar in 2008/9. The queries can be troublesome to manage, so I built a spreadsheet that concatenated (with conditionals) the source and target SQLs from mappings. You can also add an aggregate check layer. A developer helped me to build a simple exe (well simple for the developer) to execute the paired queries as a feed in list and show me the differences only. If you have some kind of internal messaging queue to leverage, you can use it to feed your change conditions so you're not always checking the same transactional data cases.

    Spectacular for always hitting regression cases.

  18. Ratna said...

    Hi Cathie,
    Your solution seems to be interesting. Can you please share some sample scripts ?
    ratnasriv@gmail.com is my id



Copyright 2006| Blogger Templates by GeckoandFly modified and converted to Blogger Beta by Blogcrowds.
No part of the content or the blog may be reproduced without prior written permission.