Establish. Execute. Evolve. ®: SSIS 2005 Script Task

Showing posts with label SSIS 2005 Script Task. Show all posts

Wednesday, May 14, 2008

On Metadata-Driven Analysis Management Objects (New Link)

All code listings described in the following posts on Metadata-Driven AMO (Parts 1, 2 and 3) may now be accessed at the link below...

Cheers to Windows Live SkyDrive. Jeers to Ripway.

- Adrian Downes

Monday, June 25, 2007

On Metadata-Driven Analysis Management Objects (Part 3)

Enhancements to the CreateNewAggregationDesign Function

Most BI practitioners will be taking full advantage of multiple-measure group support in SSAS 2005 cubes (UDMs) for a variety of reasons including support for one of the many-to-many dimension design patterns eloquently positioned by Marco Russo in his work "The Many to Many Revolution". I have two successful implementations which build on his great ideas, and, it was during this phase of development that I came across an opportunity to enhance our code base for our modest Metadata-Driven-AMO “quiver of arrows” (to quote a good friend of mine).

In order to take better advantage of multiple measure-group support in SSAS 2005 cubes, using the code listings from Part 2 of the Metadata-Driven AMO posts, you will find that it is far more appropriate to leverage the MeasureGroupDimension class instead of the CubeDimension class when adding AggregationDesignDimension objects to the .Dimensions collection of the AggregationDesign object. As you may recall, we use the AggregationDesign object to define which aggregations we will store in our cube, based on counts of dimension key attributes found in the incoming data for our new partition. The reason for this change is simple: as you iterate through a collection of CubeDimension objects, you may come across a perfectly valid CubeDimension which is not associated to the measure group partition we are generating. By using the MeasureGroupDimension class, you ensure that you are only creating aggregations relevant to your measure group partition. Moreover, both the MeasureGroupDimension class and AggregationDesignDimension class share the same inheritance lineage, so we are able to take advantage of the .CubeDimensionID property without any fuss. In Listing 10, the minor changes can be found in Section 4.

Update (14-MAY-2008):
All Listings described above may now be accessed at the link below...

- Adrian Downes

Tuesday, April 17, 2007

On Metadata-Driven Analysis Management Objects (Part 2)

Part 2: Code to Create and Process the Partition

In order to take advantage of AMO we naturally need to make use of the classes available in Microsoft.AnalysisServices.dll. Once we have defined the variables (discussed in Part 1) within the SSIS package, and wired them up to the single-row output from our metadata store using an Execute SQL Task, we can reliably retrieve and use them within the package Script Task. Our script task uses three files:

ScriptMain: serves as an entry point, connecting variable values with PartitionGeneratorController

PartitionGeneratorContorller: class file which governs execution of a PartitionGenerator object

PartitionGenerator: class containing all the properties and methods for creating and processing partitions and aggregations

A reference to Microsoft.AnalysisServices is made, ensuring that the latter two class files can use AMO appropriately.

Listing 1 is a snippet within ScriptMain, showing how the variables values are captured from the incoming DTS.Variables collection.

Next, we use the controller class to field the incoming arguments, and supply them to our PartitionGenerator object, which then creates a connection to the server (supplied by OLAPServerName and OLAPConnectionString) and iterates through collections of databases, cubes, measure groups and finally partitions. We use the metadata supplied to help us mine our way from collection to collection. Essentially, at the leaf-partition-level, the incoming partition name will always be new, since we concatentate a new DateKey to the PartitionName prefix. As mentioned in Part 1, this approach follows a "new partition each day" approach for both loading new data into the relational fact table as well as the target cube measure group. Listing 2 shows part of the nested-loop logic.

Once we have determined that the incoming partiton does not exist, we create a new partition with PartitionGenerator's .CreateNewPartition function. In Listing 3, arguments for the function are supplied as current SSAS 2005 objects we identified along the way through the nested For Each loops.

PartitionGenerator uses the QueryBindingString metadata value as an argument to define the QueryBinding property for the new partition. An object argument for the measure group is applied since it contains a reliable .ID property value that is used to define a new partition within an existing collection, within an existing measure group. We are also sure to use the Cube.Update method to ensure that the new partition is saved to the target cube measure group. Listing 4 provides the complete code for the function .CreateNewPartition. Back in the controller, provided .CreateNewPartition returns a new partition object safe and sound, we can go ahead and process it (shown in Listing 5, this follows on from the controller code shown in Listing 3).

Using the method .ProcessNewPartition, the PartitionGenerator object accepts the newly created partition, along with the measure group and cube objects This method uses the ProcessType.ProcessData enumeration option, to optimise "fast-loading" of the new partition. Using ProcessData means that we only process the fact data for the partition, and does not handle any dependent objects. A separate step (not discussed in this series of posts) handles dimensions separately. Listing 6 provides the complete code for .ProcessNewPartition. In our testing, we found that we needed to separate partition and aggregation processing logic into discrete operations, delineated by Cube.Update operations, in order for the SSIS Script Task to work effectively. The beneficial by-product of using ProcessData to load our partition is that we can quickly process aggregations with the ProcessType.ProcessIndex option.

In order to create and process aggregations, PartitionGenerator exposes another function .CreateNewAggregationDesign, which uses a dynamically-generated partition count document stored in the variable OLAPPartitionKeyCounts. This document describes the alignment between a fact table key column and its corresponding cube dimension attribute found in the partition. The PartitionCount attribute defines the distinct counts of dimension key members from the newly-generated relational fact table partition (this is also handled by a separate process, and is not covered here). Listing 7 shows what an example PartitionCount document looks like.

Further in the controller, provided the new partition is created and "fast-loaded" successfully, .CreateNewAggregationDesign is called (see Listing 8), returning a new AggregationDesign object driven by PartitionCount. The AggregationDesign is resident within a measure group, and is applied to the new partition; the ProcessType.ProcessIndexes enumeration is used to build the aggregation data and bitmap indexes for the new partition.

Inside .CreateNewAggregationDesign, the following steps are taken:

1. Determine whether an AggregationDesign already exists within the measure group (if it does, this represents a condition where the process itself is being re-run, thus the existing AggregationDesign object is overwritten by a new one with the same name)

2. Specify the EstimatedPerformanceGain property value for the AggregationDesign object (used for annotation purposes, this accomplished via the OptimisationLimit metadata value)

3. Specify the EstimatedRowCount property value from the PartitionCount document, in order to the specify the total number of rows for the partition

4. Add AggregationDesignDimension objects to the .Dimensions collection of the AggregationDesign object

5. From the PartitionCount XML document, set the EstimatedCount property value for each specified AggregationDesignDimension attribute

6. Design Aggregations for the AggregationDesign object, using the OptimisationLimit value, and return the object.

The full listing for .CreateNewAggregationDesign is shown in Listing 9. Note that we only use the referenced (ByRef) variables isFinished and optimisation in the .DesignAggregations method (of the AggregationDesign object) to indicate the limits on designing aggregations. The final step is analagous to working with the Set Aggregations Options dialog in the Aggregation Design Wizard, prior to clicking the Start button.

That's pretty much all there is to it. We now have a package in place which, with the right metadata, can process any partition on any cube measure group. On examining the completed partition (either via SSMS, or directly in XML), all the annotations, aggregation designs (visible at the measure group level), and aggregations are present. It should come as no suprise that our client queries tested significantly faster with the aggregations over no aggregations at all.

For those of you eager to rip the Listings to bits, bear in mind that a few alternates were considered before this final approach. One of the alternatives to applying the .NET code in an SSIS Script Task would be to create a stored procedure that calls a PartitionGenerator assembly (wrapping both the core class and its controller). Both Microsoft.AnalysisServices.dll and PartitionGenerator.dll would need to be registered, and, you would need to assign the permission set UNSAFE. On paper this is fine, but when you try to register Microsoft.AnalysisServices in the Database Engine, you may receive the following message in Management Studio.

Warning: The SQL Server client assembly 'microsoft.analysisservices, version=9.0.242.0, culture=neutral, publickeytoken=89845dcd8080cc91, processorarchitecture=msil.' you are registering is not fully tested in SQL Server hosted environment.

The risk conveyed by the warning was sufficient for us to explore alternatives. By the way, trying to register version 9.0.3042.0 of the .dll from SQL Server 2005 Service Pack 2 returned the same message in our environment.

Other serialisation issues also surfaced as we tried to iterate through collections of AggregationDesign objects. Such issues may be addressed by decorating one or more of the classes and methods with the Serializable() attribute in PartitionGenerator, although, since AMO is leveraged behind the scenes, it follows that they would likely need to expose classes that are serializable as well. Unfortunately, we did not have the time to research this any deeper, but I would like to hear if anyone else has found success taking this approach.

Other solutions, such as creating an SSAS assembly, or passing ASSL (XML/A) statements (either via an assembly or through a script task), did not satisfy our 'ease of support' requirements as mentioned in Part 1. I'm quite certain, however, that dynamic creation and execution of ASSL would work nicely.

While it does not profess to be the magic bullet solution, I hope it proves helpful to anyone coming to grips with AMO. As always, I welcome your feedback; if you have found a better or more efficient way around some of the issues raised, then please feel free to share!

Update (14-MAY-2008):
All Listings described above may now be accessed at the link below...

- Adrian Downes

Friday, April 13, 2007

On Metadata-Driven Analysis Management Objects (Part 1)

Part 1: Background

Over the last few months we have been hard at work on a metadata-driven ETL pipeline for an 880Gb SQL Server 2005 BI solution. Powered by SSIS, metadata has allowed us to create a single SSIS package for a generic task, re-using it many times based entirely on the information we supply to it.

For instance, our solution has a single "Import Flat File" package which uses metadata describing the source file, the system from which it originated (for traceability, among other reasons), the target "Import" table as well as a destination "Archive" folder for the file itself, once the package completes.

We take the same approach for performing surrogate-key generation, referential-integrity checks, data transformations, loads, archives and some maintennance operations in the solution.

At a higher level of abstraction, we use process codes as both an entry-point to describe all metadata associated to a given process as well as the foundation for a point-in-time instance of execution for said process (the latter is important for logging process events and errors). Metadata allows us to semantically "chain" together streams of Extract (the "Import" package described above), Transform and Load processes (as well as other supporting processes). We are able to govern the execution of process schedules through an external application, which only needs to execute dtexec (a command-line utility for executing SSIS packages) with the appropriate SSIS package and process code argument. The external process governor allows us to execute discrete processes either sequentially or in parallel, in-line with precedence-constraints for loading target tables in an Inmon-style relational data warehouse. Our relational fact tables are all partitioned, supporting high-volume daily loads, and, the generation of partitions is also metadata-driven.

Further along the value-chain, we have an SSAS 2005 database which houses a single cube (UDM) supporting a number of measure groups. Each of the measure groups is front-loaded by day-level partitions, allowing our users to continue working with their desired perspective while the new data is being processed in. The measure groups are aligned to corresponding relational fact table partitions by (you guessed it) metadata.

Altogether, the SSIS facet of the solution operates on just 8 small packages. Essentially, if there are any errors raised by the process- governing schedule by a problematic process, the problem is 99% of the time metadata-related. More to the point, we continue to add value to the system, by only implementing new structures (either in the relational or multi-dimensional areas of the solution) and metadata to describe it along established standards.

In keeping with our approach of metadata-driven reusability, we created a single package to process a new partition for any measure group, driven by a process code, and group of related metadata describing:

ServerName (the target SSAS 2005 server instance)
DatabaseName (the target SSAS 2005 database within the server instance)
CubeName (the target UDM within the SSAS 2005 database)
MeasureGroupName (the name of the measure group, which has metadata mappings to a corresponding relational fact table)
PartitionName (stored as a partition prefix in the format "MeasureGroupName_" which is concatenated to a DateKey to describe the "slice" of data within a given partition)
DateKey (an 8 digit integer, also known as a "smart key" in the format YYYYMMDD, and used as the basis for relational fact table partitioning)
QueryBinding (the query string used to define the dimension attribute keys and measure members for a measuregroup, sliced by the DateKey)
ConnectionString (relevant for dynamically connecting to a given SSAS 2005 database at run-time, providing future-proofed flexibility if another SSAS 2005 database needs to be implemented)
PartitionKeyCounts (an XML document which describes the association between a dimension key column in the relational fact table and the corresponding cube dimension attribute in the UDM.)
OptmisationLimit (reflects the "Performance Gain Reaches" property in the Aggregation Design Wizard)

This collection of metadata allows us precise control over which measure group is to be front-loaded with a new partition, and provides a high degree of flexibility for multi-server, multi-database scenarios as they may emerge.

The high volume of data the system is subject to, and, the demand for high query performance warrants precise counts of partition fact keys. Thus, in order to dynamically build and process our partitions, in addition to building and processing our aggregations at run time, we looked to AMO for simplicity in handling metadata. A generic SPROC, called by the SSIS package in an Execute SQL Task, with the appropriate process code would retrieve the "OLAP Partition Processing" metadata described above as a single row result-set, mapped to package variables, and supplied to an SSIS Script Task. We also chose to look beyond the stock control-flow tasks (Analysis Services Execute DDL Task, Analysis Services Processing Task) and data-flow tasks (Partition Processing Task) in order to ease supportability for administrators since they already understood VB.NET over ASSL (XML based Analysis Services Scripting Language).

In the second installment of this post I will show some of the code samples used to leverage AMO, and the metadata supplied to it, in an SSIS Script task.

- Adrian Downes