I recently spotted a question about the aggregation task in SSIS, asking if there was an easy way to determine the maximum or minimum value of a group. This had me confused… I distinctly recall having seen these options in the standard aggregate transformation. I quickly setup a dataflow and found exactly that:
So, on further investigation I found out the full question was if this was possibly on string fields. This time the question made sense, for string fields that list is reduced to Group By, Count, and Count Distinct.
So what to do? The wonderful thing about SSIS is that it is extendable. What is the simplest way to extend SSIS? Use the built in script tasks.
This example was built in SSIS 2005 and as such is done in VB.Net. SSIS 2008 and newer supports C#.
Now for this to work we need to setup a few things. I am using the Northwind database for this example. I create a dataflow in SSIS with an OLE DB Source connecting to the Northwind Database. The query retrieves the OrderId, the Discount and the ProductId (this field I converted to varchar for this example)
After adding a Script Transformation to the dataflow, I select these columns to be used in the script task. Then go to the Inputs and Outputs page, select Output 0. In the properties pane on the right, there is a value called SynchronousInputID. Set this value to 0. This is very important, without this option being set; the script task will create an output row for each input row.
Now add the output columns you want from this aggregation (remember to set the correct types). Note that when the script task is set to run asynchronous, the input columns will not be added to the output.
Now, open the script page and click design script. This will open up the Microsoft Visual Studio for Applications editor. In here you will have to write the code that will do the aggregations for you. A sample of what this would look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain Inherits UserComponent Dim records As Records Public Overrides Sub PreExecute() records = New Records() MyBase.PreExecute() End Sub Public Overrides Sub FinishOutputs() For Each rec As Record In records Output0Buffer.AddRow() Output0Buffer.OrderId = rec.OrderId Output0Buffer.ProductId = rec.ProductId Output0Buffer.Discount = rec.Discount Next MyBase.FinishOutputs() End Sub Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) Dim rec As Record If records.Contains(Row.OrderID) = True Then 'Already in list rec = records.Item(records.IndexOf(Row.OrderID)) Dim prdId As String Dim Discount As Single prdId = rec.ProductId If Row.ProdIdStr > prdId Then prdId = Row.ProdIdStr End If Discount = rec.Discount + Row.Discount records.Item(records.IndexOf(Row.OrderID)).ProductId = prdId records.Item(records.IndexOf(Row.OrderID)).Discount = Discount Else 'New Item rec = New Record() rec.OrderId = Row.OrderID rec.ProductId = Row.ProdIdStr rec.Discount = Row.Discount records.Add(rec) End If End Sub End Class Public Class Records Inherits Collections.CollectionBase Public Sub Add(ByVal arecord As Record) List.Add(arecord) End Sub Public Function IndexOf(ByVal orderid As Integer) As Integer Dim index As Integer = 0 For Each rec As Record In List If rec.OrderId <> orderid Then index = index + 1 Else : Exit For End If Next Return index End Function Public Function Item(ByVal index As Integer) As Record Return CType(List.Item(index), Record) End Function Public Function Contains(ByVal orderid As Integer) As Boolean Dim found As Boolean = False For Each rec As Record In List If rec.OrderId = orderid Then found = True End If Next Return found End Function End Class Public Class Record Private _OrderId As Integer Private _ProductId As String Private _Discount As Single Public Property OrderId() As Integer Get Return _OrderId End Get Set(ByVal value As Integer) _OrderId = value End Set End Property Public Property ProductId() As String Get Return _ProductId End Get Set(ByVal value As String) _ProductId = value End Set End Property Public Property Discount() As Single Get Return _Discount End Get Set(ByVal value As Single) _Discount = value End Set End Property End Class |
Important bits to note from this example are the PreExecute Sub Override, and the FinishOutputs Sub Override. These will perform the actions of setting up the task before processing rows and sending output rows once all input rows have been processed. Also note that because output rows are only created once all input rows have been handled, this will be a fully blocking transformation so make sure when using this that you have assigned adequate memory to the dataflow task.
Starting with this, you can do all kinds of custom aggregations, including concatenating, statistical, and more


