Over the last few months we have been hard at work on a metadata-driven ETL pipeline for an 880Gb SQL Server 2005 BI solution. Powered by SSIS, metadata has allowed us to create a single SSIS package for a generic task, re-using it many times based entirely on the information we supply to it.
For instance, our solution has a single "Import Flat File" package which uses metadata describing the source file, the system from which it originated (for traceability, among other reasons), the target "Import" table as well as a destination "Archive" folder for the file itself, once the package completes.
We take the same approach for performing surrogate-key generation, referential-integrity checks, data transformations, loads, archives and some maintennance operations in the solution.
Further along the value-chain, we have an SSAS 2005 database which houses a single cube (UDM) supporting a number of measure groups. Each of the measure groups is front-loaded by day-level partitions, allowing our users to continue working with their desired perspective while the new data is being processed in. The measure groups are aligned to corresponding relational fact table partitions by (you guessed it) metadata.
Altogether, the SSIS facet of the solution operates on just 8 small packages. Essentially, if there are any errors raised by the process- governing schedule by a problematic process, the problem is 99% of the time metadata-related. More to the point, we continue to add value to the system, by only implementing new structures (either in the relational or multi-dimensional areas of the solution) and metadata to describe it along established standards.
In keeping with our approach of metadata-driven reusability, we created a single package to process a new partition for any measure group, driven by a process code, and group of related metadata describing:
- ServerName (the target SSAS 2005 server instance)
- DatabaseName (the target SSAS 2005 database within the server instance)
- CubeName (the target UDM within the SSAS 2005 database)
- MeasureGroupName (the name of the measure group, which has metadata mappings to a corresponding relational fact table)
- PartitionName (stored as a partition prefix in the format "MeasureGroupName_" which is concatenated to a DateKey to describe the "slice" of data within a given partition)
- DateKey (an 8 digit integer, also known as a "smart key" in the format YYYYMMDD, and used as the basis for relational fact table partitioning)
- QueryBinding (the query string used to define the dimension attribute keys and measure members for a measuregroup, sliced by the DateKey)
- ConnectionString (relevant for dynamically connecting to a given SSAS 2005 database at run-time, providing future-proofed flexibility if another SSAS 2005 database needs to be implemented)
- PartitionKeyCounts (an XML document which describes the association between a dimension key column in the relational fact table and the corresponding cube dimension attribute in the UDM.)
- OptmisationLimit (reflects the "Performance Gain Reaches" property in the Aggregation Design Wizard)
This collection of metadata allows us precise control over which measure group is to be front-loaded with a new partition, and provides a high degree of flexibility for multi-server, multi-database scenarios as they may emerge.
The high volume of data the system is subject to, and, the demand for high query performance warrants precise counts of partition fact keys. Thus, in order to dynamically build and process our partitions, in addition to building and processing our aggregations at run time, we looked to AMO for simplicity in handling metadata. A generic SPROC, called by the SSIS package in an Execute SQL Task, with the appropriate process code would retrieve the "OLAP Partition Processing" metadata described above as a single row result-set, mapped to package variables, and supplied to an SSIS Script Task. We also chose to look beyond the stock control-flow tasks (Analysis Services Execute DDL Task, Analysis Services Processing Task) and data-flow tasks (Partition Processing Task) in order to ease supportability for administrators since they already understood VB.NET over ASSL (XML based Analysis Services Scripting Language).
In the second installment of this post I will show some of the code samples used to leverage AMO, and the metadata supplied to it, in an SSIS Script task.
- Adrian Downes