SQL row versioning to capture deltas

There is a very powerful feature in SQL Server that if you know how to take advantage of can cut down on the amount of logic you have to produce when performing deltas. This is very much a foundational implementation, not something you can necessarily introduce after the fact without going through a lot of pain. SQL Server provides a column type named rowversion which is a successor and replacement for the deprecated timestamp column type.

This methodology is great if you are importing a large amount of data every day and only want to promote what has actually changed since the last time the import occurred. This methodology can also be used to feed static reporting tables because only data that changed can be isolated.

Facts about rowversion

You cannot insert or update the rowversion of a row, it is controlled by SQL server. Therefore upon insert or update the rowversion is changed automatically.
Rowversion can be sorted and this is key for performing delta comparisons.
A nonnullable rowversion column is semantically equivalent to a binary(8) column.
A nullable rowversion column is semantically equivalent to a varbinary(8) column.
Example of what rowversion looks like: 0x000000000005862F

Rules of engagement

The way to think about using rowversion for deltas is in two parts:

Add a rowversion column, which I typically name [Version] into a target table.
1. The word “Version” is a reserved keyword which is why I use the square brackets.
Create a new table to track the max version of your target tables.

This will make more sense with the example that follows.

Example tables

The following tables are simplistic examples that each play a key role in the real world.

Example table	Real world example
dbo.DataThatChangesFrequently	Think about a table in your system that is highly transactional and each row is truly unique in what it describes such as: Employees, Departments, Buildings, Personnel, Facilities, Entities, Chart of Accounts, Companies, Residents, Visitors etc… If for any reason, you need to know when any row has changed so you can do something with it or copy it else where, then this table is your candidate for adding that [Version] column.
dbo.CollectionTracker	Out of all of the examples provided in the above row, you need to track what the max rowversion was the last time you processed your deltas. Therefore each table has to be tracked along with its last max [Version] value.
dbo.ICollectNumbersForFun	This is just a silly example destination table where we copy our deltas to. In the real world this can be any destination, doesn’t even have to be a table. There is no destination rowversion column in this table because it isn’t needed. This is a one way relationship that copies delta data from “dbo.DataThatChangesFrequently” to “dbo.ICollectNumbersForFun”.

Example setup

The following example is designed to demonstrate how rowversion works with meaningless integer data. Here the tables are being created and they are being filled with random test data. My test output will be different from yours, but they will work the same if you are following along.

Full code example located on GitHub

dbo.DataThatChangesFrequently

Below you will see that a while loop is being used to fill the table. This is being done on purpose because the RAND() function will only return a different value in a query when it is called one time and a seed is NOT provided. Therefore it has to be looped. For demonstration purposes this is okay, but not ideal for real world of course.

CREATE TABLE dbo.DataThatChangesFrequently
(
	DataThatChangesFrequentlyId INT IDENTITY(1,1) NOT NULL,
	SomeNumber INT NOT NULL,
	[Version] ROWVERSION NOT NULL,
	CONSTRAINT [PK_dbo.DataThatChangesFrequently_DataThatChangesFrequentlyId] PRIMARY KEY(DataThatChangesFrequentlyId)
)

GO

DECLARE @i INT = 0;
DECLARE @count INT = 10;

WHILE @i < @count
BEGIN
	INSERT INTO dbo.DataThatChangesFrequently
	(
		SomeNumber
	)
	VALUES
	(RAND()*100000)
	
	SET @i += 1;
END

SELECT * FROM dbo.DataThatChangesFrequently

-- There is a method to the madness
SELECT MIN(Version), MAX(Version) FROM dbo.DataThatChangesFrequently

dbo.CollectionTracker

The setup for this table is always going to be an initial value because you need a baseline of where you are starting. If you are setting this up for the first time then using zero is safe because zero will be less than the rowversion provided by SQL Server.

CREATE TABLE dbo.CollectionTracker
(
	CollectionTrackerId INT IDENTITY(1,1) NOT NULL,
	[Key] VARCHAR(50) NOT NULL,
	[Description] VARCHAR(255) NULL,
	LastVersion CHAR(16) NOT NULL, -- Storing as char for the sake of portability
	CollectedOnUtc DATETIME2(0) NOT NULL,
	CONSTRAINT [PK_dbo.CollectionTracker_CollectionTrackerId] PRIMARY KEY (CollectionTrackerId)
)

-- Initializtion is a good first step to not bother with zero rows - zero rows will never matter after the first time so why bother now
-- Make this more sophisticated by adding in what it's tracking and don't track the same thing more than once
INSERT INTO dbo.CollectionTracker
(
    [Key]
   ,LastVersion
   ,CollectedOnUtc
)
VALUES
(   
    'DataThatChangesFrequently'
   ,'0000000000000000' -- Smallest version for now, don't store the 0x it will mess up the conversion
   ,GETUTCDATE() -- CollectedOnUtc - datetime2(0)
)

I have on many occasions tried storing the binary value as an actual binary data type, but for the life of me have had nothing by problems doing so. I feel like SQL Server has overcomplicated how binary is stored, but that’s a different problem for another time. Therefore to avoid making terrible mistakes with how the data is stored you can completely bypass using a binary data type and just substitute it for a CHAR(16).

Why 16 zeros?

Here you can see it is being initialized to 16 zeros, this is important because if you don’t fill out all space provided by a CHAR field you will inherit blanks. That’s just how CHAR works.

Sixteen zeros are used because a BINARY(8) will produce 16 characters always sans the 0x prefix.

dbo.ICollectNumbersForFun

As mentioned earlier this is a destination for the data in the source. This is where the power of the [Version] column and tracked last version come into play. You recall your “last max version” which by now with the passage of time and changes in the system is your new minimum version. Comparing the last version to your [Version] column as shown below will get you everything that has a greater version than your last (max) version. This literally exposes your delta.

CREATE TABLE dbo.ICollectNumbersForFun
(
	ICollectNumbersForFunId INT IDENTITY(1,1) NOT NULL,
	OooANumber INT NOT NULL,
	CONSTRAINT [PK_dbo.ICollectNumbersForFun_ICollectNumbersForFunId] PRIMARY KEY (ICollectNumbersForFunId)
)

GO

-- Ok Let's collect some numbers! First time synchronization usually means let's get everything
DECLARE @lastVersion binary(8);
DECLARE @maxVersion binary(8);

SELECT
	@lastVersion = CONVERT(binary(8), LastVersion, 2)
FROM dbo.CollectionTracker
WHERE [Key] = 'DataThatChangesFrequently'

SELECT
	@maxVersion = MAX([Version])
FROM dbo.DataThatChangesFrequently
WHERE [Version] > @lastVersion

SELECT 
	 @lastVersion AS LastVersion
	,@maxVersion AS MaxVersion

-- This is for demonstration purposes only - you would really want to do a manual upsert or use the merge construct
INSERT INTO dbo.ICollectNumbersForFun
(
    OooANumber
)
SELECT 
	SomeNumber 
FROM dbo.DataThatChangesFrequently
WHERE [Version] > @lastVersion

-- Mark down what the latest version was
UPDATE dbo.CollectionTracker SET
	CollectedOnUtc = GETUTCDATE(),
	LastVersion = CONVERT(CHAR(16), @maxVersion, 2)
WHERE [Key] = 'DataThatChangesFrequently'

SELECT * FROM dbo.CollectionTracker ct
SELECT * FROM dbo.ICollectNumbersForFun

This is just a demonstration example, but lets assume you do something meaningful with your delta data. After you perform your work it is important to capture your new “Last max version” and store it in the tracking table for the next time you process deltas.

Understanding how to process your deltas

There are three types of deltas and it is entirely up to the designer to implement the types of deltas needed as explained in the following table.

Delta Type	Processing method
New data	Data exists in the source but not in the destination. You need to perform an insert with this data on the destination. This can be achieved by querying the deltas versus the destination using the `EXCEPT` clause.
Updated data	Easiest scenario. You match incoming deltas to your destination using an in common key to join on and perform an update as normal. Easily done using a `SELECT FROM JOIN WHERE`.
Deleted data	The exact opposite of the “New data” scenario and not always necessary to act upon which is why this comes down to design requirements usually. Data no longer exists in the source, but is still present in the destination. You can hard or soft delete that data in the destination by using the `EXCEPT` clause. It would be the reverse query of what is used for “New data” and would involve a `DELETE FROM` for hard deletes or an `UPDATE` for soft deletes.

Beyond initial load

The above code only demonstrates initial load and inserts the deltas into an empty destination table. That’s not real world usually, but I think the idea is more clear now how to use the rowversion data type. Below is an example of coercing the rowversion to change in the source, however no processing code is provided.

SELECT * FROM dbo.DataThatChangesFrequently dtcf

DECLARE @i INT = 1;
DECLARE @count INT = 4;

WHILE @i <= @count
BEGIN
	UPDATE dbo.DataThatChangesFrequently SET
		SomeNumber = RAND()*100000
	WHERE DataThatChangesFrequentlyId = @i
		
	SET @i += 1;
END

SELECT * FROM dbo.CollectionTracker ct
SELECT * FROM dbo.DataThatChangesFrequently dtcf
SELECT * FROM dbo.ICollectNumbersForFun

GO

-- No way to reprocess unless there is a change
SELECT 
	* 
FROM dbo.DataThatChangesFrequently
WHERE [Version] > 0x000000000005862F -- Example of the last maximum value stored. Your value will be different.

Benefits of this methodology

If you are performing imports of large datasets and the data is not already a delta, then this is very beneficial for not taxing your production systems with queries that perform redundant “Did this data change?” checks. Just by utilizing this method together with a staging table, you can figure out exactly what changed and only distribute those changes to your production environment. The first run of this methodology is usually the slowest because there is no baseline compare, but subsequent runs should be magnitudes in order faster because you will only be processing what changed.

What this methodology does not do

This methodology is great for telling you that a row of data has changed, but it will NOT tell you WHAT on the row changed. In order to know what changed on your row you can either:

Perform a column by column comparison which is okay to do when you don’t have a lot of columns to compare.
Use a Hashing algorithm to hash together all of your relevant columns and perform a single comparison on those hashes. There is only one drawback to the hashing approach and that is sometimes the hashing algorithm may not produce unique results, but it’s very unlikely.

Obscure Problems and Gotchas

Facts about rowversion

Rules of engagement

Example tables

Example setup

dbo.DataThatChangesFrequently

dbo.CollectionTracker

dbo.ICollectNumbersForFun

Understanding how to process your deltas

Beyond initial load

Benefits of this methodology

What this methodology does not do

Related

Leave a Reply Cancel reply

Facts about rowversion

Rules of engagement

Example tables

Example setup

dbo.DataThatChangesFrequently

dbo.CollectionTracker

dbo.ICollectNumbersForFun

Understanding how to process your deltas

Beyond initial load

Benefits of this methodology

What this methodology does not do

Share this:

Related

Leave a Reply Cancel reply