Unique IDs generator across shards
Let's start off from this following implementation by the Instagram team:
CREATE SEQUENCE IF NOT EXISTS table_id_seq
AS bigint
START WITH 1000;
CREATE OR REPLACE FUNCTION next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1568851200000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
SELECT nextval('table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;
The above code creates a sequence called table_id_seq
that starts at 1000, and a function called next_id
that generates unique IDs.
The function uses the clock_timestamp
function to get the current time in milliseconds, and subtracts a fixed epoch value (1568851200000
) to create a timestamp that is relative to the epoch. This relative timestamp is then shifted left 23 bits to make room for the shard ID and sequence ID, which are then added to the timestamp using bitwise OR operations. Finally, the generated ID is returned as the output of the function.
The IDs generated by this function will be unique across all shards, and will be roughly sortable by the time they were generated.
But my problem with this implementation is that the generated IDs are very long and their range of uniqueness is not that long.
So how would we go about this?
We can use more bits for the relative timestamp and fewer bits for the shard ID and sequence ID. For example, we could use 24 bits for the relative timestamp, 4 bits for the shard ID, and 12 bits for the sequence ID, like this:
CREATE OR REPLACE FUNCTION next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1568851200000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
SELECT nextval('table_id_seq') % 4096 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 14;
result := result | (shard_id << 12);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;
In this modified version of the function, 24 bits are used for the relative timestamp, 4 bits are used for the shard ID, and 12 bits are used for the sequence ID. This means that the IDs generated by the function will be unique for the next 16,777,216 seconds (about 194 days) after the epoch, across 16 different shards, and for up to 4096 unique IDs within a single shard and timestamp. The IDs will be 3 bits shorter than the IDs generated by the original function, and they will have a longer range of uniqueness.
You can adjust the number of bits used for each part of the ID to find a balance between the range of uniqueness and the length of the IDs that works best for your use case.
Another way would be by reducing the value of the our_epoch
variable. The our_epoch
variable specifies the point in time (in milliseconds) from which the relative timestamp in the IDs is calculated. By reducing the value of this variable, we can increase the range of uniqueness of the IDs, because the relative timestamp will have more bits available to represent the time elapsed since the epoch.
For example, if ywe reduce the value of the our_epoch
variable from 1568851200000
to 1500000000000
, the relative timestamp will be able to represent the time elapsed since the epoch using 22 bits instead of 21 bits.
But one probelem still persists. After the range of uniqueness of the IDs generated by the next_id
function has been reached, the function will start generating duplicate IDs. This will happen if the current time in milliseconds is equal to or greater than the epoch value plus the range of uniqueness. For example, if the epoch value is 1568851200000 and the range of uniqueness is 194 days, the function will start generating duplicate IDs after 1568851200000 + (194 * 24 * 60 * 60 * 1000) = 1573569600000 milliseconds.
Once the function starts generating duplicate IDs, it will no longer be possible to use the IDs to uniquely identify entities. We will need to come up with a new strategy for generating unique IDs, such as increasing the range of uniqueness by using more bits for the relative timestamp, or using a different epoch value.
It's important to note that the range of uniqueness of the IDs generated by the next_id
function is only approximate, because it depends on the accuracy of the clock_timestamp
function and the precision of the nextval
function. The actual range of uniqueness may be shorter or longer than the value calculated based on the number of bits used for each part of the ID.
A remedy to this would be to automatically change the epoch value when the range of uniqueness has been reached. This will allow the function to continue generating unique IDs without any manual intervention.
Here is the final next_id
function:
CREATE OR REPLACE FUNCTION next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1568851200000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
max_millis bigint := our_epoch + (194 * 24 * 60 * 60 * 1000);
BEGIN
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
IF now_millis >= max_millis THEN
our_epoch := our_epoch + (194 * 24 * 60 * 60 * 1000);
max_millis := our_epoch + (194 * 24 * 60 * 60 * 1000);
END IF;
SELECT nextval('judiye.table_id_seq') % 4096 INTO seq_id;
result := (now_millis - our_epoch) << 14;
result := result | (shard_id << 12);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;
In this modified version, the our_epoch
and max_millis
variables are updated whenever the current time in milliseconds is equal to or greater than the max_millis
value. This means that the epoch value will be automatically incremented by the range of uniqueness whenever the function reaches the end of its current range of uniqueness. The function will then be able to generate unique IDs for the next 194 days after the updated epoch value.
This approach has the advantage of allowing the function to generate unique IDs for an indefinite period of time, as long as the epoch value is updated whenever the range of uniqueness is reached.
However, it also has the disadvantage of potentially generating IDs that are not strictly increasing over time, because the epoch value can be updated multiple times. This may not be a problem for some use cases, but it could be an issue for others.