Don’t use PIVOT for non-aggregate-able data

I ran into a rather displeasurable experience where a separate team of people built some very important tools for us and they did a lackluster job and that’s being nice. They decided that instead of using what SSIS provides for parsing files, they would use tSQL to parse row data from flat files. This was baffling to say the least to me. I unfortunately had to debug a problem that their stellar code was producing. It was removing duplicate rows from input data. No one asked for this requirement, but it was happening. Upon further inspection I realize it is because they are using PIVOT to take the parsed row data and translate it into columns.

This is not what PIVOT is for and further more since PIVOT requires an aggregate these geniuses decided that they would just use the parsed token values as the aggregate column because “Eh it works” and they can claim they finished what they were asked to do. They obviously didn’t test anything because they would have realized that the data was being de-duplicated which isn’t what we asked for.

How to pivot data without an aggregation column

The most important thing to take away from using a PIVOT is that it requires a column to aggregate. If you don’t have that, it won’t work. Even if you force it to work, you can run into a situation where duplicate rows will be removed regardless of using MIN or MAX.

Credit where credit is due

I got the idea I am presenting below from this stack overflow answer, from the user Charles Bretana. I never would have thought this up myself.

Disclamer

I would not recommend using this approach for large data sets because what I am doing here is expensive. I found myself between a rock and a hard place requiring a solution for this problem yesterday, so this is what I came up with, but I know I have to replace it with something better later.

The basic approach

If you have data that doesn’t have any kind of natural grouping to it, such as a primary key or some kind of identifier, then you have to add one yourself.

The following sections below go through each part of the whole example I have at the bottom of the page.

This can be done using the following inefficient methodology:

Example table and data

The following PIVOT won’t work because it takes the duplicate rows and gets the minimum value of the Token. There is nothing unique/distinct about each group of data. In other words there is no way to group the data properly so that the aggregate function will not just group all like data together. PIVOT by nature requires a column to aggregate on and
this data does not contain that.

In order to group the data some columns have to added to the target table. An EntryId in order to access each column separately and a grouping column to group each set of data together logically.

Each row will be iterated over and given a group number.

Finally the actual pivoting of the data happens without using the PIVOT function, instead a GROUP BY is used in conjunction with a broken CASE statement. Please be aware that this CASE only has a WHEN THEN, but it is purposely leaving out the ELSE condition which is bad practice.

Whole code example

So you don’t have to copy/paste each section, here is the whole script with comments in it so you can run it all in one shot. There is a drop statement commented out at the end.

Conclusion

I hope this helps. I plan on adding my version of the answer on Stack Overflow soon.

Leave a Reply

Your email address will not be published. Required fields are marked *