Don’t use PIVOT for non-aggregate-able data

I ran into a rather displeasurable experience where a separate team of people built some very important tools for us and they did a lackluster job and that’s being nice. They decided that instead of using what SSIS provides for parsing files, they would use tSQL to parse row data from flat files. This was baffling to say the least to me. I unfortunately had to debug a problem that their stellar code was producing. It was removing duplicate rows from input data. No one asked for this requirement, but it was happening. Upon further inspection I realize it is because they are using PIVOT to take the parsed row data and translate it into columns.

This is not what PIVOT is for and further more since PIVOT requires an aggregate these geniuses decided that they would just use the parsed token values as the aggregate column because “Eh it works” and they can claim they finished what they were asked to do. They obviously didn’t test anything because they would have realized that the data was being de-duplicated which isn’t what we asked for.

How to pivot data without an aggregation column

The most important thing to take away from using a PIVOT is that it requires a column to aggregate. If you don’t have that, it won’t work. Even if you force it to work, you can run into a situation where duplicate rows will be removed regardless of using MIN or MAX.

Credit where credit is due

I got the idea I am presenting below from this stack overflow answer, from the user Charles Bretana. I never would have thought this up myself.

Disclamer

I would not recommend using this approach for large data sets because what I am doing here is expensive. I found myself between a rock and a hard place requiring a solution for this problem yesterday, so this is what I came up with, but I know I have to replace it with something better later.

The basic approach

If you have data that doesn’t have any kind of natural grouping to it, such as a primary key or some kind of identifier, then you have to add one yourself.

The following sections below go through each part of the whole example I have at the bottom of the page.

This can be done using the following inefficient methodology:

Example table and data

CREATE TABLE #tmpParsedData 
( 
	 [ColumnPosition] INT
	,[Token] nvarchar(500)
	,[CreatedOnUtc] datetime2(7)
	,[Lineitem] nvarchar(500)
	,[FileName] varchar(500) 
)

INSERT INTO #tmpParsedData
VALUES
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' )

The following PIVOT won’t work because it takes the duplicate rows and gets the minimum value of the Token. There is nothing unique/distinct about each group of data. In other words there is no way to group the data properly so that the aggregate function will not just group all like data together. PIVOT by nature requires a column to aggregate on and
this data does not contain that.

SELECT
	 t.[1]
	,t.[2]
	,t.[3]
	,t.[4]
	,t.[5]
	,t.[6]
	,t.[FileName]
	,t.Lineitem
	,t.CreatedOnUtc
FROM #tmpParsedData PIVOT(MIN(Token) FOR ColumnPosition IN ([1], [2], [3], [4], [5], [6])) AS t

In order to group the data some columns have to added to the target table. An EntryId in order to access each column separately and a grouping column to group each set of data together logically.

-- EntryId indexed at zero so that each row can be accessed individually
ALTER TABLE #tmpParsedData ADD EntryId INT IDENTITY(0, 1)

-- GroupColumn so that each logical group of data (row of data) can be grouped together
-- This is what is missing in order for the PIVOT to work
ALTER TABLE #tmpParsedData ADD GroupColumn INT NULL

Each row will be iterated over and given a group number.

-- WHILE loop setup
DECLARE @i INT = 0;
DECLARE @count INT = (SELECT COUNT(EntryId) FROM #tmpParsedData);
DECLARE @g INT = 0;

-- For each row by EntryId
WHILE @i < @count
BEGIN
	DECLARE @col INT;

	-- Get the column position
	SELECT @col = ColumnPosition FROM #tmpParsedData WHERE EntryId = @i;
	
	-- Set the group number
	UPDATE #tmpParsedData SET GroupColumn = @g WHERE EntryId = @i;

	-- If the column position for this group is six, then increment the group number to start a new group
	IF @col = 6 SET @g += 1;
	
	SET @i += 1;
END

Finally the actual pivoting of the data happens without using the PIVOT function, instead a GROUP BY is used in conjunction with a broken CASE statement. Please be aware that this CASE only has a WHEN THEN, but it is purposely leaving out the ELSE condition which is bad practice.

-- Pivot the data without using the PIVOT function in order to pivot data that doesn't have a real aggregate
SELECT 
      MIN(CASE ColumnPosition WHEN 1 THEN [Token] END) AS Col1
     ,MIN(CASE ColumnPosition WHEN 2 THEN [Token] END) AS Col2
     ,MIN(CASE ColumnPosition WHEN 3 THEN [Token] END) AS Col3
     ,MIN(CASE ColumnPosition WHEN 4 THEN [Token] END) AS Col4
     ,MIN(CASE ColumnPosition WHEN 5 THEN [Token] END) AS Col5
     ,MIN(CASE ColumnPosition WHEN 6 THEN [Token] END) AS Col6
FROM #tmpParsedData
GROUP BY GroupColumn -- This is what makes it work, group each set of data by the group number that was set in the loop above

Whole code example

So you don’t have to copy/paste each section, here is the whole script with comments in it so you can run it all in one shot. There is a drop statement commented out at the end.

CREATE TABLE #tmpParsedData 
( 
	 [ColumnPosition] INT
	,[Token] nvarchar(500)
	,[CreatedOnUtc] datetime2(7)
	,[Lineitem] nvarchar(500)
	,[FileName] varchar(500) 
)

INSERT INTO #tmpParsedData
VALUES
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 1, N'01385', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 2, N'61335', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 3, N'10072019', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 4, N'01:45', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 5, N'1', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' ), 
( 6, N'613', N'2019-10-18T02:03:05.8733333', N'01385|61335|10072019|01:45|1|613', 'ImportFile.txt' )

/* This won't work because it takes the duplicate rows and gets the minimum value of the Token 
   There is nothing unique/distinct about each group of data.
   In other words there is no way to group the data properly so that the aggregate function will
   not just group all like data together. PIVOT by nature requires a column to aggregate on and
   this data does not contain that. */
SELECT
	 t.[1]
	,t.[2]
	,t.[3]
	,t.[4]
	,t.[5]
	,t.[6]
	,t.[FileName]
	,t.Lineitem
	,t.CreatedOnUtc
FROM #tmpParsedData PIVOT(MIN(Token) FOR ColumnPosition IN ([1], [2], [3], [4], [5], [6])) AS t

-- The solution is to add a grouping component to the target table and data

-- EntryId indexed at zero so that each row can be accessed individually
ALTER TABLE #tmpParsedData ADD EntryId INT IDENTITY(0, 1)

-- GroupColumn so that each logical group of data (row of data) can be grouped together
-- This is what is missing in order for the PIVOT to work
ALTER TABLE #tmpParsedData ADD GroupColumn INT NULL

-- WHILE loop setup
DECLARE @i INT = 0;
DECLARE @count INT = (SELECT COUNT(EntryId) FROM #tmpParsedData);
DECLARE @g INT = 0;

-- For each row by EntryId
WHILE @i < @count
BEGIN
	DECLARE @col INT;

	-- Get the column position
	SELECT @col = ColumnPosition FROM #tmpParsedData WHERE EntryId = @i;
	
	-- Set the group number
	UPDATE #tmpParsedData SET GroupColumn = @g WHERE EntryId = @i;

	-- If the column position for this group is six, then increment the group number to start a new group
	IF @col = 6 SET @g += 1;
	
	SET @i += 1;
END

-- Pivot the data without using the PIVOT function in order to pivot data that doesn't have a real aggregate
SELECT 
      MIN(CASE ColumnPosition WHEN 1 THEN [Token] END) AS Col1
     ,MIN(CASE ColumnPosition WHEN 2 THEN [Token] END) AS Col2
     ,MIN(CASE ColumnPosition WHEN 3 THEN [Token] END) AS Col3
     ,MIN(CASE ColumnPosition WHEN 4 THEN [Token] END) AS Col4
     ,MIN(CASE ColumnPosition WHEN 5 THEN [Token] END) AS Col5
     ,MIN(CASE ColumnPosition WHEN 6 THEN [Token] END) AS Col6
FROM #tmpParsedData
GROUP BY GroupColumn -- This is what makes it work, group each set of data by the group number that was set in the loop above

--DROP TABLE #tmpParsedData

Conclusion

I hope this helps. I plan on adding my version of the answer on Stack Overflow soon.

Leave a Reply

Your email address will not be published. Required fields are marked *