There is a very powerful feature in SQL Server that if you know how to take advantage of can cut down on the amount of logic you have to produce when performing deltas. This is very much a foundational implementation, not something you can necessarily introduce after the fact without going through a lot of pain. SQL Server provides a column type named rowversion which is a successor and replacement for the deprecated timestamp column type.
This methodology is great if you are importing a large amount of data every day and only want to promote what has actually changed since the last time the import occurred. This methodology can also be used to feed static reporting tables because only data that changed can be isolated.
Facts about rowversion
- You cannot insert or update the rowversion of a row, it is controlled by SQL server. Therefore upon insert or update the rowversion is changed automatically.
- Rowversion can be sorted and this is key for performing delta comparisons.
- A nonnullable rowversion column is semantically equivalent to a binary(8) column.
- A nullable rowversion column is semantically equivalent to a varbinary(8) column.
- Example of what rowversion looks like:
Rules of engagement
The way to think about using rowversion for deltas is in two parts:
- Add a rowversion column, which I typically name
[Version]into a target table.
- The word “Version” is a reserved keyword which is why I use the square brackets.
- Create a new table to track the max version of your target tables.
This will make more sense with the example that follows.
The following tables are simplistic examples that each play a key role in the real world.
|Example table||Real world example|
|dbo.DataThatChangesFrequently||Think about a table in your system that is highly transactional and each row is truly unique in what it describes such as: Employees, Departments, Buildings, Personnel, Facilities, Entities, Chart of Accounts, Companies, Residents, Visitors etc…|
If for any reason, you need to know when any row has changed so you can do something with it or copy it else where, then this table is your candidate for adding that [Version] column.
|dbo.CollectionTracker||Out of all of the examples provided in the above row, you need to track what the max rowversion was the last time you processed your deltas. Therefore each table has to be tracked along with its last max [Version] value.|
|dbo.ICollectNumbersForFun||This is just a silly example destination table where we copy our deltas to. In the real world this can be any destination, doesn’t even have to be a table. There is no destination rowversion column in this table because it isn’t needed. This is a one way relationship that copies delta data from “dbo.DataThatChangesFrequently” to “dbo.ICollectNumbersForFun”.|
The following example is designed to demonstrate how rowversion works with meaningless integer data. Here the tables are being created and they are being filled with random test data. My test output will be different from yours, but they will work the same if you are following along.
Full code example located on GitHub
Below you will see that a while loop is being used to fill the table. This is being done on purpose because the
RAND() function will only return a different value in a query when it is called one time and a seed is NOT provided. Therefore it has to be looped. For demonstration purposes this is okay, but not ideal for real world of course.
CREATE TABLE dbo.DataThatChangesFrequently ( DataThatChangesFrequentlyId INT IDENTITY(1,1) NOT NULL, SomeNumber INT NOT NULL, [Version] ROWVERSION NOT NULL, CONSTRAINT [PK_dbo.DataThatChangesFrequently_DataThatChangesFrequentlyId] PRIMARY KEY(DataThatChangesFrequentlyId) ) GO DECLARE @i INT = 0; DECLARE @count INT = 10; WHILE @i < @count BEGIN INSERT INTO dbo.DataThatChangesFrequently ( SomeNumber ) VALUES (RAND()*100000) SET @i += 1; END SELECT * FROM dbo.DataThatChangesFrequently -- There is a method to the madness SELECT MIN(Version), MAX(Version) FROM dbo.DataThatChangesFrequently
The setup for this table is always going to be an initial value because you need a baseline of where you are starting. If you are setting this up for the first time then using zero is safe because zero will be less than the rowversion provided by SQL Server.
CREATE TABLE dbo.CollectionTracker ( CollectionTrackerId INT IDENTITY(1,1) NOT NULL, [Key] VARCHAR(50) NOT NULL, [Description] VARCHAR(255) NULL, LastVersion CHAR(16) NOT NULL, -- Storing as char for the sake of portability CollectedOnUtc DATETIME2(0) NOT NULL, CONSTRAINT [PK_dbo.CollectionTracker_CollectionTrackerId] PRIMARY KEY (CollectionTrackerId) ) -- Initializtion is a good first step to not bother with zero rows - zero rows will never matter after the first time so why bother now -- Make this more sophisticated by adding in what it's tracking and don't track the same thing more than once INSERT INTO dbo.CollectionTracker ( [Key] ,LastVersion ,CollectedOnUtc ) VALUES ( 'DataThatChangesFrequently' ,'0000000000000000' -- Smallest version for now, don't store the 0x it will mess up the conversion ,GETUTCDATE() -- CollectedOnUtc - datetime2(0) )
I have on many occasions tried storing the binary value as an actual binary data type, but for the life of me have had nothing by problems doing so. I feel like SQL Server has overcomplicated how binary is stored, but that’s a different problem for another time. Therefore to avoid making terrible mistakes with how the data is stored you can completely bypass using a binary data type and just substitute it for a
Why 16 zeros?
Here you can see it is being initialized to 16 zeros, this is important because if you don’t fill out all space provided by a CHAR field you will inherit blanks. That’s just how CHAR works.
Sixteen zeros are used because a BINARY(8) will produce 16 characters always sans the 0x prefix.
As mentioned earlier this is a destination for the data in the source. This is where the power of the [Version] column and tracked last version come into play. You recall your “last max version” which by now with the passage of time and changes in the system is your new minimum version. Comparing the last version to your [Version] column as shown below will get you everything that has a greater version than your last (max) version. This literally exposes your delta.
CREATE TABLE dbo.ICollectNumbersForFun ( ICollectNumbersForFunId INT IDENTITY(1,1) NOT NULL, OooANumber INT NOT NULL, CONSTRAINT [PK_dbo.ICollectNumbersForFun_ICollectNumbersForFunId] PRIMARY KEY (ICollectNumbersForFunId) ) GO -- Ok Let's collect some numbers! First time synchronization usually means let's get everything DECLARE @lastVersion binary(8); DECLARE @maxVersion binary(8); SELECT @lastVersion = CONVERT(binary(8), LastVersion, 2) FROM dbo.CollectionTracker WHERE [Key] = 'DataThatChangesFrequently' SELECT @maxVersion = MAX([Version]) FROM dbo.DataThatChangesFrequently WHERE [Version] > @lastVersion SELECT @lastVersion AS LastVersion ,@maxVersion AS MaxVersion -- This is for demonstration purposes only - you would really want to do a manual upsert or use the merge construct INSERT INTO dbo.ICollectNumbersForFun ( OooANumber ) SELECT SomeNumber FROM dbo.DataThatChangesFrequently WHERE [Version] > @lastVersion -- Mark down what the latest version was UPDATE dbo.CollectionTracker SET CollectedOnUtc = GETUTCDATE(), LastVersion = CONVERT(CHAR(16), @maxVersion, 2) WHERE [Key] = 'DataThatChangesFrequently' SELECT * FROM dbo.CollectionTracker ct SELECT * FROM dbo.ICollectNumbersForFun
This is just a demonstration example, but lets assume you do something meaningful with your delta data. After you perform your work it is important to capture your new “Last max version” and store it in the tracking table for the next time you process deltas.
Understanding how to process your deltas
There are three types of deltas and it is entirely up to the designer to implement the types of deltas needed as explained in the following table.
|Delta Type||Processing method|
|New data||Data exists in the source but not in the destination. You need to |
perform an insert with this data on the destination. This can be
achieved by querying the deltas versus the destination using
|Updated data||Easiest scenario. You match incoming deltas to your destination|
using an in common key to join on and perform an update as
normal. Easily done using a
|Deleted data||The exact opposite of the “New data” scenario and not always|
necessary to act upon which is why this comes down to design
requirements usually. Data no longer exists in the source, but
is still present in the destination. You can hard or soft delete
that data in the destination by using the
be the reverse query of what is used for “New data” and would
Beyond initial load
The above code only demonstrates initial load and inserts the deltas into an empty destination table. That’s not real world usually, but I think the idea is more clear now how to use the rowversion data type. Below is an example of coercing the rowversion to change in the source, however no processing code is provided.
SELECT * FROM dbo.DataThatChangesFrequently dtcf DECLARE @i INT = 1; DECLARE @count INT = 4; WHILE @i <= @count BEGIN UPDATE dbo.DataThatChangesFrequently SET SomeNumber = RAND()*100000 WHERE DataThatChangesFrequentlyId = @i SET @i += 1; END SELECT * FROM dbo.CollectionTracker ct SELECT * FROM dbo.DataThatChangesFrequently dtcf SELECT * FROM dbo.ICollectNumbersForFun GO -- No way to reprocess unless there is a change SELECT * FROM dbo.DataThatChangesFrequently WHERE [Version] > 0x000000000005862F -- Example of the last maximum value stored. Your value will be different.
Benefits of this methodology
If you are performing imports of large datasets and the data is not already a delta, then this is very beneficial for not taxing your production systems with queries that perform redundant “Did this data change?” checks. Just by utilizing this method together with a staging table, you can figure out exactly what changed and only distribute those changes to your production environment. The first run of this methodology is usually the slowest because there is no baseline compare, but subsequent runs should be magnitudes in order faster because you will only be processing what changed.
What this methodology does not do
This methodology is great for telling you that a row of data has changed, but it will NOT tell you WHAT on the row changed. In order to know what changed on your row you can either:
- Perform a column by column comparison which is okay to do when you don’t have a lot of columns to compare.
- Use a Hashing algorithm to hash together all of your relevant columns and perform a single comparison on those hashes. There is only one drawback to the hashing approach and that is sometimes the hashing algorithm may not produce unique results, but it’s very unlikely.