Friday, April 13, 2007

On Metadata-Driven Analysis Management Objects (Part 1)

Part 1: Background

Over the last few months we have been hard at work on a metadata-driven ETL pipeline for an 880Gb SQL Server 2005 BI solution. Powered by SSIS, metadata has allowed us to create a single SSIS package for a generic task, re-using it many times based entirely on the information we supply to it.

For instance, our solution has a single "Import Flat File" package which uses metadata describing the source file, the system from which it originated (for traceability, among other reasons), the target "Import" table as well as a destination "Archive" folder for the file itself, once the package completes.

We take the same approach for performing surrogate-key generation, referential-integrity checks, data transformations, loads, archives and some maintennance operations in the solution.
At a higher level of abstraction, we use process codes as both an entry-point to describe all metadata associated to a given process as well as the foundation for a point-in-time instance of execution for said process (the latter is important for logging process events and errors). Metadata allows us to semantically "chain" together streams of Extract (the "Import" package described above), Transform and Load processes (as well as other supporting processes). We are able to govern the execution of process schedules through an external application, which only needs to execute dtexec (a command-line utility for executing SSIS packages) with the appropriate SSIS package and process code argument. The external process governor allows us to execute discrete processes either sequentially or in parallel, in-line with precedence-constraints for loading target tables in an Inmon-style relational data warehouse. Our relational fact tables are all partitioned, supporting high-volume daily loads, and, the generation of partitions is also metadata-driven.

Further along the value-chain, we have an SSAS 2005 database which houses a single cube (UDM) supporting a number of measure groups. Each of the measure groups is front-loaded by day-level partitions, allowing our users to continue working with their desired perspective while the new data is being processed in. The measure groups are aligned to corresponding relational fact table partitions by (you guessed it) metadata.

Altogether, the SSIS facet of the solution operates on just 8 small packages. Essentially, if there are any errors raised by the process- governing schedule by a problematic process, the problem is 99% of the time metadata-related. More to the point, we continue to add value to the system, by only implementing new structures (either in the relational or multi-dimensional areas of the solution) and metadata to describe it along established standards.

In keeping with our approach of metadata-driven reusability, we created a single package to process a new partition for any measure group, driven by a process code, and group of related metadata describing:
  • ServerName (the target SSAS 2005 server instance)
  • DatabaseName (the target SSAS 2005 database within the server instance)
  • CubeName (the target UDM within the SSAS 2005 database)
  • MeasureGroupName (the name of the measure group, which has metadata mappings to a corresponding relational fact table)
  • PartitionName (stored as a partition prefix in the format "MeasureGroupName_" which is concatenated to a DateKey to describe the "slice" of data within a given partition)
  • DateKey (an 8 digit integer, also known as a "smart key" in the format YYYYMMDD, and used as the basis for relational fact table partitioning)
  • QueryBinding (the query string used to define the dimension attribute keys and measure members for a measuregroup, sliced by the DateKey)
  • ConnectionString (relevant for dynamically connecting to a given SSAS 2005 database at run-time, providing future-proofed flexibility if another SSAS 2005 database needs to be implemented)
  • PartitionKeyCounts (an XML document which describes the association between a dimension key column in the relational fact table and the corresponding cube dimension attribute in the UDM.)
  • OptmisationLimit (reflects the "Performance Gain Reaches" property in the Aggregation Design Wizard)

This collection of metadata allows us precise control over which measure group is to be front-loaded with a new partition, and provides a high degree of flexibility for multi-server, multi-database scenarios as they may emerge.

The high volume of data the system is subject to, and, the demand for high query performance warrants precise counts of partition fact keys. Thus, in order to dynamically build and process our partitions, in addition to building and processing our aggregations at run time, we looked to AMO for simplicity in handling metadata. A generic SPROC, called by the SSIS package in an Execute SQL Task, with the appropriate process code would retrieve the "OLAP Partition Processing" metadata described above as a single row result-set, mapped to package variables, and supplied to an SSIS Script Task. We also chose to look beyond the stock control-flow tasks (Analysis Services Execute DDL Task, Analysis Services Processing Task) and data-flow tasks (Partition Processing Task) in order to ease supportability for administrators since they already understood VB.NET over ASSL (XML based Analysis Services Scripting Language).

In the second installment of this post I will show some of the code samples used to leverage AMO, and the metadata supplied to it, in an SSIS Script task.

- Adrian Downes

1 comment:

Vidas Matelis said...

Andrian,

Any chance you will also share your code for single "Import Flat File" package ? I tried to do something like that on DTS in SQL 2000, but it was just too much coding and I decided it its not worth. If you do not want to share actual code, just high level approach would be also very helpful.