Understanding How UUIDs Are Generated
You’ve likely used UUIDs in projects before and assumed them to be unique. Today, we’ll take a look at the main aspects of the implementation and understand why UUIDs are practically unique, though an incredibly small potential for duplication exists.
The modern-day implementation of UUIDs can be tied back to RFC 4122 which introduced 5 different approaches for generating these identifiers. We’ll take a look at each one and we’ll step through the implementation details of Version 1 & Version 4 in a moment.
UUIDs, or universally unique IDentifiers, are simply 128-bit numbers used to uniquely identify items in software development. Their canonical textual representation is as a series of 32 hexadecimal characters which are separated into five groups by hyphens in the form 8-4-4-4-12.
Embedded inside this seemingly random series of hexadecimal characters is information about the UUID implementation.
The values in the M and N locations uniquely identify the UUID version and variant respectively.
The version number is identified by looking at the most significant 4 bits of the value in the M position.
The following table lists the currently defined versions:
The variant field determines the layout of the information embedded in the UUID. The interpretation of all other bits in the UUID depends on the value of the variant.
We identify the variant by considering the first 1-3 most significant bits of N.
The most common implementation nowadays is Variant 1 where the MSB0 is fixed as 1 and MSB1 is fixed as 0. This means that considering our wildcard values – the bits marked with an
x above – the only possible values are 8, 9, A, or B.
1 0 0 0 = 8
1 0 0 1 = 9
1 0 1 0 = A
1 0 1 1 = B
So, if you ever see a UUID with these values in the N position, you’ll know it’s the common Variant 1 type.
Version 1 (Time Based + Unique or Random Host Identifier)
In this version, the UUID is generated by taking the current timestamp and some identifying property of the device generating the UUID – most commonly, the MAC address (also called the node ID).
This UUID is generated by concatenating the 48 bit MAC address, a 60 bit-timestamp, and a 14 bit “uniquifying” clock sequence, along with the 6 reserved bits for version and variant to generate a unique UUID.
The clock sequence is simply a value incremented every time the clock is modified.
The timestamp used in this version is the number of 100 nanosecond time intervals since October 15, 1582 – the date of Gregorian reform to the Christian calendar.
You may be familiar with Unix systems and time since epoch. This is simply just a different Day 0. There are several algorithms online that allow you to convert one time representation to the other, so we won’t go over this here.
Though this implementation seems fairly straightforward and robust, because it reveals the MAC address of the machine it was generated on, this approach is not suitable for all use cases. Especially, when security is a major concern. Instead, some implementations will use 6 random bytes sourced from a cryptographically secure random number generator as a replacement for the node ID.
To assemble a Version 1 UUID, we’ll do the following steps:
- Take the low 32 bits of the current UTC timestamp. This will be the first 4 bytes / 8 hex characters of our UUID [TimeLow]
- Take the middle 16 bits of the current UTC timestamp. These will be the following 2 bytes / 4 hex characters. [TimeMid]
- The next 2 bytes / 4 hex characters will concat the 4 bit UUID version with the remaining high 12 bits of the current UTC timestamp (which is 60 bits in total). [TimeHighAndVersion]
- Now, the next 1-3 bits will specify the variant of the UUID version. The remaining bits will contain the clock sequence which is meant to contribute some small amount of randomness to this implementation. In doing so, it helps avoid collisions in the event multiple UUID generators are running on the same system, a system clock for a UUID generator is set backward, or the system clock doesn’t advance fast enough. [ClockSequenceHiAndRes && ClockSequenceLow]
- The final 6 bytes / 12 hex characters / 48 bits are the “node id” which is most commonly the MAC address of the issuing device. [NodeID]
Putting everything together, our Version 1 UUID is generated by concatenating:
TimeLow + TimeMid + TimeHighAndVersion + (ClockSequenceHiAndRes && ClockSequenceLow) + NodeID
Due to this implementation’s reliance on the clock, there are some edge cases we’ll need to handle. Firstly, to minimize correlation across systems, the clock sequence defaults to a random number – this will only happen once in the lifetime of a system. This has the added benefit of allowing us to support node identifiers that may move from system to system as the initial value of the clock sequence is completely agnostic of the node identifier.
Remember that the main purpose of the clock sequence is to introduce some amount of randomness into the equation. The bits allocated for the clock sequence help us extend the timestamp and account for situations where multiple UUIDs are generated before the processor clock advances. This helps us avoid the potential creation of duplicates when the clock is set backward in time (the device is powered off) or the node ID changes. If the clock is set backward or might have been set backward (e.g., while the system was powered off), and the UUID generator can not be sure that no UUIDs were generated with timestamps larger than the value to which the clock was set, then the clock sequence has to be changed. If the previous value of the clock sequence is known, it can just be incremented; otherwise, it should be set to a random or high-quality pseudo-random value.
Version 2 (Distributed Computing Environment Security)
The main difference between this version and Version 1 is that this implementation uses some identifier specific to the system in place of the “randomness” generated by using the least significant bits of the clock sequence. This value is often just the current user’s ID. This version is less common and only a small deviation from Version 1, so we won’t explore it any further.
Version 3 (Name-based + MD5 Hash)
In the event that you want to have unique identifiers for information within a namespace or for “nameable” information more generally, UUID version 3 and version 5 are your preferred options.
They’ll encode any “nameable” entity (website, DNS information, plain text, etc) into the UUID value. The main takeaway here is that for the same namespace and text, the generated UUID will be the same.
Take note that the namespace itself is a UUID.
let namespace = “digitalbunker.dev” let namespaceUUID = UUID3(.DNS, namespace) // Ex: UUID3(namespaceUUID, “/category/things-you-should-know-1/”) 4896c91b-9e61-3129-87b6-8aa299028058 UUID3(namespaceUUID, “/category/things-you-should-know-2/”) 29be0ee3-fe77-331e-a1bf-9494ec18c0ba UUID3(namespaceUUID, “/category/things-you-should-know-3/”) 33b06619-1ee7-3db5-827d-0dc85df1f759
In this implementation, the UUID of the namespace is transformed to a string of bytes, concatenated with the input name, then hashed with MD5, yielding the 128 bits for the UUID. Then, we’ll overwrite some of those bits to accurately reflect the version and variant information leaving the rest unchanged.
It’s also important to understand that neither the namespace nor the inputted name can be generated from the UUID. This is a one-way operation. The only exception here would be through a brute force approach if one of the values (namespace or text) were known by the attacker.
For Version 3 and Version 5, as long as you use the same inputs, the generated UUID will be deterministic.
Version 4 (PRNG)
This is probably the simplest implementation of the bunch.
As we’ve now seen, 6 bits are reserved for the version and variant information leaving us 122 bits free to decide. This version simply generates all 128 random bits and then fills in the values for the version and variant information as a secondary step.
UUIDs of this variety rely heavily on the quality of the PRNG in use (pseudo-random number generator). If the PRNG is lacking a sophisticated algorithm or the correct seed and initialization values, the likelihood of a duplicate can increase. To better understand how computers generate random numbers, check out my previous article.
Version 4 is what you’ll find most commonly implemented in modern programming languages.
Its implementation is reasonably simple.
- Generate 128 random bits
- Now, we’ll need to overwrite some of these bits with the correct version and variant information
- Take the 7th byte and perform an AND operation with
0x0Fto clear out the high nibble. Then, OR it with
0x40to set the version number to 4.
- Next, take the 9th byte and perform an AND operation with
0x3Fand then OR it with
- Convert the 128 bits to hexadecimal representation and insert the hyphens to achieve the canonical text representation.
Version 5 (Name-based + SHA-1 Hash)
This version is no different than Version 3 with the exception that the
SHA-1 hashing algorithm is used here in place of
MD5. This version is preferred to Version 3 (SHA-1 > MD5).
One of the notable benefits of UUIDs is that their uniqueness does not depend on a central authority or coordination between different systems. Anyone can create UUIDs with reasonable assurance that a duplicate value does not exist and will conceivably not be made in the future.
This has the added benefit of allowing UUIDs generated by independent parties to be combined into a single database or moved across databases with a trivial probability of duplication/collision.
Due to this uniqueness, you can use UUIDs as primary keys in databases, unique filenames for uploaded files, unique names for any web resource, or allow vendors to create and register UUIDs without the need of a central authority. This is a double-edged sword, however. Due to this lack of a central authority, it’s impossible to keep track of what UUIDs have previously been issued.
There are also a few shortcomings to address. Though this inherent randomness aids in security, it complicates matters like debugging. Additionally, UUIDs may be overkill in certain situations. For example, it wouldn’t make sense to use 128 bits to uniquely identify some data that itself may be less than 128 bits in size.
Looking at these implementations, it may appear that given enough time, you’d eventually repeat a value. Especially in the case of Version 4, with its reliance on random numbers. However, in practice, the amount and frequency with which you’d need to generate UUIDs in order to have a chance at a duplicate is completely impractical.
If you were to generate 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. This is of course assuming that the PRNG used in generating the UUID introduced sufficient entropy (“real randomness”) into the mix, otherwise, the probability of duplicates count be higher. In a slightly more tangible example, if you were to generate 10,000,000,000,000 [10 trillion] UUIDs, the chance of 2 UUIDs being the same is 0.00000006 %.
And in the case of Version 1, the clock would roll over only in 3603 A.D. So, unless you’re planning on having your service run for another 1,583 years, you’re safe on this front too.
The potential for duplicates is still there and some systems do try to account for this, but for the overwhelming majority of use cases, UUIDs can be treated as if they are truly unique. If you still need more convincing, here’s a simple visualization/proof of how unlikely a collision would be in practice.
Hope you enjoyed this article!
Sign up for my newsletter to be notified when new articles are published!