Mar
14

SSIS and Aggregations

I recently spotted a question about the aggregation task in SSIS, asking if there was an easy way to determine the maximum or minimum value of a group.  This had me confused… I distinctly recall having seen these options in the standard aggregate transformation.  I quickly setup a dataflow and found exactly that:

So, on further investigation I found out the full question was if this was possibly on string fields.  This time the question made sense, for string fields that list is reduced to Group By, Count, and Count Distinct.

So what to do?  The wonderful thing about SSIS is that it is extendable.  What is the simplest way to extend SSIS?  Use the built in script tasks.

This example was built in SSIS 2005 and as such is done in VB.Net.  SSIS 2008 and newer supports C#.

Now for this to work we need to setup a few things.  I am using the Northwind database for this example. I create a dataflow in SSIS with an OLE DB Source connecting to the Northwind Database.  The query retrieves the OrderId, the Discount and the ProductId (this field I converted to varchar for this example)

 

After adding a Script Transformation to the dataflow, I select these columns to be used in the script task.  Then go to the Inputs and Outputs page, select Output 0.  In the properties pane on the right, there is a value called SynchronousInputID.  Set this value to 0.  This is very important, without this option being set; the script task will create an output row for each input row.

Now add the output columns you want from this aggregation (remember to set the correct types).  Note that when the script task is set to run asynchronous, the input columns will not be added to the output.

Now, open the script page and click design script.  This will open up the Microsoft Visual Studio for Applications editor. In here you will have to write the code that will do the aggregations for you.  A sample of what this would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
    Inherits UserComponent
    Dim records As Records

    Public Overrides Sub PreExecute()
        records = New Records()
        MyBase.PreExecute()
    End Sub

    Public Overrides Sub FinishOutputs()
        For Each rec As Record In records
            Output0Buffer.AddRow()
            Output0Buffer.OrderId = rec.OrderId
            Output0Buffer.ProductId = rec.ProductId
            Output0Buffer.Discount = rec.Discount
        Next
        MyBase.FinishOutputs()
    End Sub

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        Dim rec As Record

        If records.Contains(Row.OrderID) = True Then
            'Already in list
            rec = records.Item(records.IndexOf(Row.OrderID))
            Dim prdId As String
            Dim Discount As Single

            prdId = rec.ProductId
            If Row.ProdIdStr > prdId Then
                prdId = Row.ProdIdStr
            End If
            Discount = rec.Discount + Row.Discount

            records.Item(records.IndexOf(Row.OrderID)).ProductId = prdId
            records.Item(records.IndexOf(Row.OrderID)).Discount = Discount
        Else
            'New Item
            rec = New Record()
            rec.OrderId = Row.OrderID
            rec.ProductId = Row.ProdIdStr
            rec.Discount = Row.Discount
            records.Add(rec)
        End If
    End Sub
End Class

Public Class Records
    Inherits Collections.CollectionBase

    Public Sub Add(ByVal arecord As Record)
        List.Add(arecord)
    End Sub

    Public Function IndexOf(ByVal orderid As Integer) As Integer
        Dim index As Integer = 0
        For Each rec As Record In List
            If rec.OrderId <> orderid Then
                index = index + 1
            Else : Exit For
            End If
        Next
        Return index
    End Function

    Public Function Item(ByVal index As Integer) As Record
        Return CType(List.Item(index), Record)
    End Function

    Public Function Contains(ByVal orderid As Integer) As Boolean
        Dim found As Boolean = False
        For Each rec As Record In List
            If rec.OrderId = orderid Then
                found = True
            End If
        Next
        Return found
    End Function

End Class

Public Class Record
    Private _OrderId As Integer
    Private _ProductId As String
    Private _Discount As Single

    Public Property OrderId() As Integer
        Get
            Return _OrderId
        End Get
        Set(ByVal value As Integer)
            _OrderId = value
        End Set
    End Property

    Public Property ProductId() As String
        Get
            Return _ProductId
        End Get
        Set(ByVal value As String)
            _ProductId = value
        End Set
    End Property

    Public Property Discount() As Single
        Get
            Return _Discount
        End Get
        Set(ByVal value As Single)
            _Discount = value
        End Set
    End Property
End Class

Important bits to note from this example are the PreExecute Sub Override, and the FinishOutputs Sub Override.  These will perform the actions of setting up the task before processing rows and sending output rows once all input rows have been processed.  Also note that because output rows are only created once all input rows have been handled, this will be a fully blocking transformation so make sure when using this that you have assigned adequate memory to the dataflow task.

Starting with this, you can do all kinds of custom aggregations, including concatenating, statistical, and more

Jul
06

A good day at the office

In May this year, I was asked to take over as tech-lead on a project at the company I’m contracted out to. The current tech-lead resigned and decided to move on to greener pastures.

What I discovered when I took over the reigns was a little unnerving…

The solution is more than 5 years old, and it has been through many technology overhauls, from classic ASP, to ASP.NET 1.0, 1.1 and is now running in 2.0

However, this hasn’t been a clean transition, with a lot of the code still being in classic ASP, as well as ASP.NET 1.1 (syntactically, using deprecated methods etc.).

There are no clean separations of tiers. All of the business logic either resides either in the front-end (GUI), or in SQL stored procedures and ad-hoc queries. The sub-projects that do exist (for database calls) have namespace and class names like db, c, m and wc… (I’m still trying to figure it out myself)

But, it’s not all doom and gloom. The technology may be a hinderance, but what makes this all bearable is the team. The guys are a tightly-knit team that never fail to deliver, even at the cost of overtime and working weekends. They care about the solution, even though they couldn’t get it to compile.

That’s where I come in.

A little work had to be done on the team dynamic, but mostly the code needed a lot of attention. Over the last few weeks, I’ve been stepping through the code, page by page, trying to get it to compile and reach some sort of stabilisation point. Once I had reached that point, I put the solution into a newly created TFS branch which will be the base for all development going forward.

Then, I put it into CruiseControl.net, so we can get red/green light notifications on how healthy our new branch is, upon every check-in.

This has been a lesson in software development evolution for the team, as they were silo’d in the way they approached software development before. They’re very excited about this, as they now have a computer to tell them when something is broken, instead of patching software as they were doing before, hoping for the best, and discovering bugs way down the software delivery line.

The tasks following this primary but essential step will be applying unit tests retrospectively to the existing code base, and upgrading the legacy code in the classic ASP pages and the 1.x ASP.NET pages to at least ASP.NET 2.0, if not 3.5.

So far, it has been an interesting challenge, and I’m confident only good can come from this. The team are new to proper source control procedures, continuous integration and unit testing. Showing them the why’s and how’s, along with separating business logic out of presentation tier into their own contained, testable tiers is something I’m really looking forward to.

There’s no better reward than seeing the light bulb above their heads switch on as they suddenly realise that software development can not only be easier, but fun too.

Jul
05

A new initiative

So, at the company that I’m currently contracted / outsourced to, we’ve decided to setup a guidance forum among a few of the senior developers here, some of them from the company of employ, some of them from external vendors (like myself).

The purpose of this forum is to expose us to patterns, practices and all the other cool stuff that’s out there on the web which will enable us to enhance the way we work with our set of development tools and our team members. Riaan Hanekom is one of the members, and he presented on Dependency Injection and Inversion of Control a couple of weeks back. The week after was my turn on presenting Aspect Oriented Programming, using a product called PostSharp.

I will post the sample project I put together in a separate post, so you can see just how cool PostSharp is, as well as how easy it is to implement.

Watch this space.

Mar
26

Achieving 100% code coverage when using NUnit and NCover

Something that has always annoyed me when trying to get 100% code coverage when testing a method that throws an expected exception, is that it exits the method before getting to the end of it (obviously).

An illustration will aid what I’m talking about…

With the above setup, calling repository.GetProjects() throws an InvalidOperationException, as projects are scoped to a company. This is the desired behaviour of the repository class.

However, as you can see, the last brace of the NUnit unit test never gets run, giving you less than 100% code coverage (according to NCover).

This skews your stats of your test suite, making it look like you have untested code. As you can see, that is not the case.

However, today I found a new way to test for expected exceptions in your code AND get 100% code coverage!

NUnit comes with an awesome little method called Assert.Throws().

This allows you to to test a method call which you expect to throw an exception, and carry on with the rest of your unit test.

Here’s the same unit test, rewritten using Assert.Throws():

As you can see, just remove the ExpectedException attribute at the top of the unit test, and let Assert.Throws() handle the rest.

Perfect! 100% code coverage on expected exceptions, no more skewed unit test results.

I hope this helps you with your testing stats too.

Nov
18

“Good” SSIS

What is “good” SSIS? What is best practice for SSIS packages?  How do I make my SSIS better?  These are all questions I hear a lot.  The answer is never simple but usually include some mention of naming conventions (I have seen too many packages where tasks, sources, destinations and connections are using default names), the correct use of sequence containers, the use of fast load parameters to setup your dataflow tasks for optimal performance in your environment, and in quite a few cases, breaking a large, complicated SSIS package into smaller packages.

So why break a package up into smaller packages?  It simplifies debugging and maintenance, and in many cases aids performance.  Ever find yourself waiting a long time for the package to start doing actual work?  Consider smaller packages.  There are a few limitations and issues with doing this.  The obvious issue is configuration, you are no longer setting up connections in just one package – you now have to set them up in several packages.  Add this to a good infrastructure of a development, testing and production environment and suddenly it becomes a big problem.  Now surely this can be overcome by using configuration files?  This in itself poses a few problems.  Do you create a configuration file for each package? Do you have all your connections in every package, regardless of whether you use them or not?  Do you configure the connections in the job steps?

So far the problems in breaking up your SSIS package can be overcome.  But what other limitations are there in doing this?  I often have information that needs to be passed between tasks.  This means I now have to split my packages based on dependencies rather than functionality.  Breaking up the package based on functionality does make much more sense though, mainly because this would make the small packages reusable for other projects.  The standard way for calling these smaller packages is where the limitation lies.  The standard execute package task in SSIS allows you to pass variable values into a package by using the configuration settings of the package.  It does not allow you to pass values back.

By being able to pass values into a package and returning values, you will be able to refactor  your SSIS, making your SSIS packages reusable.

So back to the original question – what is “good” SSIS?  “Good” SSIS is like good code.  Good SSIS should be maintainable, reusable and optimal.

With these requirements in mind I decided to do my bit for SSIS.  I created an enhanced execute package task that allows you to pass variables into the package, return variables from as well as pass database connection information.  For more information on this go to http://ssisexec.codeplex.com

-EDIT- I have rebuilt the enhanced execute package task for SSIS 2008 R2 :-)

Oct
21

Eclipse and Aptana make a formidable pair

Around half way through this year, for a couple of reasons, I challenged myself to learn a new programming language or two.

Namely, Python and Ruby made it to the top of the list and now I was on the hunt for a good IDE that would assist me while I was getting my Python and Ruby legs. After searching the net for a few days, Eclipse with the Aptana plug-in seemed to be recommended more and more by fellow developers out there. And I can see why.

Eclipse, primarily a Java IDE (written in Java), has been around for quite a few years, has a rich history and is arguably the preferred IDE of many open-source developers. It’s ability to load plug-ins allows it to be extended to work with other programming languages like Python, Ruby, Perl, PHP, Rails, JavaScript, jQuery, Microsoft ASP.NET Ajax Controls and lots more. It’s like a developers dream come true.

Where Aptana comes in, is by providing a single plug-in base that gives you access to pretty much all of the languages I’ve just mentioned above (except Perl, I think). It has a fantastic library which gives you an awesome platform from which to work. You can also run Aptana as a stand-alone IDE without the need for Eclipse, but that’s a completely separate download.

Being able to write programs in Python, Ruby and jQuery with auto-completion has made development easier and fun again.

Best of all, this is all for free. Aptana offers a cloud hosting solution for which you can pay, and Eclipse has a donation-ware approach, but all the basics are absolutely free.

Oct
11

Subversion, free source control that works

One thing that still amazes me about the majority of developers I meet, is that they either don’t use source control or (even worse) are still using Microsoft Visual SourceSafe on their projects! (Read about why SourceSafe is a BAD source control system here and here.)

Whether it be on private work or projects at their place of employ, they are of the idea that source code corruption or even total loss will never happen to them. Backups? Who needs backups?

On one of the recent projects I’ve been working on, we were originally going to use Microsoft Team Foundation Server (TFS) as our source control system. TFS is an excellent source control system from Microsoft, but comes with a hefty price tag, making it only accessible to corporates with deep pockets. Of course, TFS offers much more than just source control, it also offers work item tracking, bug tracking and project methodology templates, such as the Microsoft Solution Framework (MSF) for Agile Software Development, MSF for CMMI Process Improvement and more.

Our only reason for not using TFS on this project was red tape. To get the company to open up their firewall and port forward all traffic for TFS over the web would involve weeks, if not months of bureaucracy to get what we needed. You may have picked up by now that we were working off-site from the client’s offices, yet they wanted to be in control of the source code.

After looking at a couple of options, we found a technology called Subversion (SVN), an open source version control system. It offers versioning of files and directories, branching and merging and atomic commits, so no half check-ins. Oh, and it’s free.

As with most version control systems, it’s based on a client-server architecture. You need a source control server which will manage user access & repositories, and source control clients pushing and pulling content to and from the server.

VisualSVN Server is a free Subversion server for Windows and it’s totally self-contained. It comes with a light version of Apache which allows it to accept connections over standard HTTP(S). It’s as simple as downloading it, installing it, setting up a repository and users, and you’re good to go.

VisualSVN also offer a Visual Studio plug-in that allows you to interface with the VisualSVN server directly in your IDE, but this plug-in is not free. However, CollabNet Corporation, offer AnkhSVN, a free Visual Studio plug-in that can interface with any SVN server. It has all the features you’ve grown accustomed to by SourceSafe and TFS, such as Get Latest Version, Check-In and most of all the other features you’ve come to use in your every day development.

You can also use a program called TortoiseSVN which integrates directly into Windows Explorer, allowing you to get latest version and such from outside the IDE. TortoiseSVN offers some great tools in its feature set, like file comparison and merging tools, essential for when you’re working in a large team and changes have occurred to files while you were working on them.

Personally, I run both TortoiseSVN and AnkhSVN as sometimes I need to inspect the repository to see who commited what changes, and I don’t necessarily need to load my Visual Studio IDE in order to do that.

If you don’t have a server to install VisualSVN server on, or you’re apprehensive to running a server application on your desktop or laptop, you can always use one of the very many free source code hosting solutions out on the web. www.xp-dev.com and www.projectlocker.com are just 2 examples that provide excellent service, for free or for a small fee, the latter obviously giving you more of an offering.

I hope this post will provide some motivation for you to start using a good source control system. Good code is a valuable commodity, don’t risk losing it.

Oct
10

Welcome to the NETDEV blogs!

So, we’ve finally moved to our own dedicated server, with none of the restrictions of a shared hosting provider.

Expect some exceptional blogs from some of the best software development and infrastructure specialists in South Africa. I’ll chip in a blog post or two as well.

I’m really looking forward to providing some useful content to the software development community in South Africa, as the platforms that currently exist in this country are far from any level of maturity conducive to helping and educating others in the IT field.

Cheers for now, and wish us luck!

Gareth.