2026-02-15

A couple years ago now I made a sizable mistake when choosing how to handle hashing of strings. This was incredibly embarrassing for me, but hey I’m human, I never told anyone I was perfect or infallible. I have been wanting to write about this for a while and just didn’t have the time to dedicate to it until now. My goal of writing this is just to warn others on what not to do and the trade offs of requiring unique hashes for a collection of strings. Since this was a few years ago now (like three?) the links to the source I was modeling my hashing off of have changed.

Don’t do this, but here’s what I did

I will explain further into this article why you don’t do what I am about to demonstrate, but first I have to setup how I got here. The following code is a direct copy from Microsoft’s repository from mscorlib and the system namespace. I am copying it here because I even noticed that after three-ish years it has changed. The implementation in the latest version, which is dot net 10 at the time of writing this, is totally different.

https://github.com/microsoft/referencesource/blob/main/mscorlib/system/string.cs

// Gets a hash code for this string.  If strings A and B are such that A.Equals(B), then
// they will return the same hash code.
[System.Security.SecuritySafeCritical]  // auto-generated
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override int GetHashCode() 
{

#if FEATURE_RANDOMIZED_STRING_HASHING
    if(HashHelpers.s_UseRandomizedStringHashing)
    {
        return InternalMarvin32HashString(this, this.Length, 0);
    }
#endif // FEATURE_RANDOMIZED_STRING_HASHING

    unsafe 
    {
        fixed (char *src = this) 
        {
            Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
            Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
            int hash1 = (5381<<16) + 5381;
#else
            int hash1 = 5381;
#endif
            int hash2 = hash1;

#if WIN32
            // 32 bit machines.
            int* pint = (int *)src;
            int len = this.Length;
            while (len > 2)
            {
                hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                pint += 2;
                len  -= 4;
            }

            if (len > 0)
            {
                hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
            }
#else
            int     c;
            char *s = src;
            while ((c = s[0]) != 0) {
                hash1 = ((hash1 << 5) + hash1) ^ c;
                c = s[1];
                if (c == 0)
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ c;
                s += 2;
            }
#endif
#if DEBUG
            // We want to ensure we can change our hash function daily.
            // This is perfectly fine as long as you don't persist the
            // value from GetHashCode to disk or count on String A 
            // hashing before string B.  Those are bugs in your code.
            hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
            return hash1 + (hash2 * 1566083941);
        }
    }
}

Forewarnings

I took the above code and noted a few things:

  1. It’s running in an unsafe block, which would be my ultimate undoing. I didn’t fully appreciate what this meant at the time, and well I do now 🤦.
  2. It’s architecture specific, meaning if you are running x86 or x64 this code will run differently which means it will produce different hash values for identical inputs.
  3. There is a good possibility the maintainers would change this hashing algorithm without notice.
  4. This hashing algorithm shouldn’t be used to produce predictable hashes.

Arrogant decisions

For the above forewarnings I handled it with a careless attitude unfortunately, because I arrogantly thought I would make it work for my needs. Therefore, for each of the forewarnings these were my responses/thinking:

  1. 🤷 Ah, should be fine, I will test it extremely well so that I won’t have any surprises.
  2. No problem, I will burn in the x64 flavor.
  3. No problem, I will copy today’s code so that it cannot change without notice. This way I am not dependent on the System.String.GetHashCode() implementation, it will be my own implementation that I can test thoroughly.
  4. Well, I need a predictable hashing algorithm that I can rely on to let me concatenate several strings together and then produce a hash on that resultant string.
    • It’s not a secret or a password, therefore it’s okay if it doesn’t have a high degree of entropy.
    • All of those warnings are there because of the switch in architecture and the warning that the maintainers will purposely change the DailyBuildNumber property for testing purposes.

🤡 What’s the worst that can happen? I am going to make sure to test this extremely well. 🤡

Behold my mistake was born

For I am a stupid dumb dumb, I created this syntactically correct compileable mistake that no one should repeat:

  public int GetHashCode(string input) 
  {
    unsafe 
    {
      fixed (char *src = input) 
      {  
        int hash1 = 5381;
        int hash2 = hash1;
        int c;
        
        char *s = src;
        
        while ((c = s[0]) != 0) 
        {
            hash1 = ((hash1 << 5) + hash1) ^ c;
            c = s[1];
            if (c == 0)
                break;
            hash2 = ((hash2 << 5) + hash2) ^ c;
            s += 2;
        }

        return hash1 + (hash2 * 1566083941);
      }
    }
  }

This code:

  1. Compiles
  2. Runs
  3. Produces the same hash every time
  4. Works specifically for x64 architecture, by design

Here is an example of the input and output:

  1. GetHashCode("Hello, World!");
  2. Output: 562640209

So what went wrong?

This output is repeatable, in goes “Hello, World!” and out comes 562640209 every time! Success!

Ok, on to my real world use case. I needed to hash mailing addresses so that anytime a mailing address was provided to an API, we could determine if it was a verified address. Once the verification is performed, it doesn’t need to be performed again because that would just needlessly incur a cost. Therefore, we have the usual seven components of an address:

SegmentMax char lengthNotes
Address150
Address250
City40
State2Using the Alpha ISO 2 standard
Postal Code10Supporting plus 4 codes
County50Not part of mailing address, but needed for project
Country3Using the Alpha ISO 3 standard

Conditions

  • I protected my inputs from the diagonal edge case by having single lowercase character markers prepended to each of the seven segments. This way you know if something was intentionally left blank (or null).
  • Segment order matters, so it has to stay consistent.
  • I purposely made all of the addresses lowercase so we didn’t have to bother with distinguishing the difference in case of two identical addresses. Whether an address is in uppercase or lowercase doesn’t change anything about what the actual value of the address means.
  • Let’s ignore the null base case for now just to keep this simple. I did handle that too, but it’s just a base case.
  • Max concatenated character length is 205

Test address

For argument’s sake let’s say that the test address is “1600 Pennsylvania Avenue NW, Washington, DC 20500”, which is the address for the White House, a very publicly well known address.

SegmentValueMarkerMarker + Lowercase(Value)
Address11600 Pennsylvania Avenue NWaa1600 pennsylvania avenue nw
Address2bb
CityWashingtonccwashington
StateDCdddc
Postal Code20500ee20500
Countyff
CountryUSAggusa
  • Input string: “a1600 pennsylvania avenue nwbcwashingtonddce20500fgusa”
  • Output value: -1169433873

Observations

The output value is negative. 🤔 This should have given me pause, but it didn’t, because I have seen string hashcode values be negative before, so I just took it for granted, not giving it a second thought. This is where I fucked up 🥺.

Why the negative value should give you pause

Remember in the hashing example there is an unsafe context modifier? Yeah… well that allows the integers c, hash1, and hash2 to overflow without throwing an OverflowException. In the real world, when a register overflows, it just rolls over from all nines to zero to start counting again. That’s what this is doing, except the value is int.MaxValue or 2,147,483,647.

So what happens when you add 1 to 2,147,483,647 in an unsafe context? You overflow and get -2,147,483,648 which is int.MinValue.

Why is this bad?

This isn’t inherently bad, but it is terrible if you were expecting UNIQUE hash codes per unique string. Therefore, this means you are guaranteed collisions based on the size of your data set.

I thought I performed thorough testing, but I quickly found out that I was 100% wrong as soon as I started working with production data. I ran a number of unit tests, to test mins and maxes. I really tried to cause a collision and I couldn’t. My production data set size was 38 million rows at the time (it’s larger now) and that produced collisions at a much higher rate than I would have ever expected.

How to deal with collisions

This depends on your goals. The collisions themselves are not exactly a bad thing depending on what you are doing. So for example, if you don’t depend on having one hash code equal to one entry in your database, then this works fine. What you do is compute the input has, perform a lookup on that hash to locate the set of matches, then perform a granular comparison on each segment to find what you are looking for. This is perfectly acceptable.

However, if you were expecting to have one unique hash per database entry, this won’t work because of the overflow. Since the hash code algorithm allows the integer to rollover there is a good likelihood that you are re-using hash codes.

How to get unique hash codes

This unfortunately, 😢 is one of the most sought after problems in the cryptography space. It is not exactly possible to not have collisions. The only thing you can do is mitigate the number of collisions that could possibly occur. This is a very complicated topic, so I am not going to get into it here. Essentially what is needed is a Perfect hash function.

Now what?

With this terrible realization 😱, I had to move quick to find a solution. I did a lot of research on what I needed for a hashing algorithm to work to cover like a string that is 212 characters (205 + 7 markers). I discovered that I would need to use a cryptographic algorithm, which I thought was overkill because like I said originally this data isn’t sensitive data on its own.

Well I tried using a SHA256 algorithm:

public string CryptographicHash(string input)
{
  using var sha = SHA256.Create();
  
  byte[] hash = sha.ComputeHash(Encoding.UTF8.GetBytes(input));
  
  return Convert.ToHexString(hash);
}

Same input as before and here’s the output:

7C0C0214FB317747FD56AE98309BE5E716242AC300FB026A2DEDF3D5AFE71043

Oh boy… Let’s line those up next to each other for the full effect:

ValueLength
Inputa1600 pennsylvania avenue nwbcwashingtonddce20500fgusa54
string.GetHashCode()-116943387311
SHA256 Hash7C0C0214FB317747FD56AE98309BE5E716242AC300FB026A2DEDF3D5AFE7104364

Now we have a new problem. The output length is LONGER 🤯 than the input length! This now makes storing the hash in the database pointless because we might as well just use the original input itself for identification. Very disappointing. I ran statistics on my data set and an overwhelming number of the addresses were about 60 characters long, therefore I had find a different way forward.

The analogue

I ended up just replacing the original hash code with the input string and calling it “the analogue”. This worked, but is a less than impressive way of doing this. Other complications that made this so much harder was the fact that I had to use MongoDB, and well, that was a complete disaster in itself. We made plans to get off of MongoDB and into SQL Server (where we should have started). The new design foregoes any kind of hash, and all searches would just be performed using all seven segments with an index to match. Again, not at all sexy, but it works.

I could have used MongoDB built in functions, but I refused to do that so that we weren’t married to MongoDB – I stand by my decision. It’s the same reason I refused to use the MongoDB Object Identifier and instead used a common GUID. We are going to easily break away from MongoDB.

Conclusion

Don’t make this dumbass mistake that I made. Learn from my mistakes.

I don’t feel comfortable providing the collisions I found because they are from production, but I assure you, on too many occasions two or more completely different addresses ended up having the same hashes. It was a nightmare. I learned my lesson to not take something as important as unsafe for granted ever again.

Generated the thumbnail image with CoPilot (because I am lazy and unimaginative).

Leave a Reply

Your email address will not be published. Required fields are marked *