Skip to content

RFC: Future-proofed cryptographic hash values. #1

Open
@jbenet

Description

@jbenet

Problem

As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, ... one. This introduces large costs: systems may assume a particular hash size, or call sha1 all over the place.

It's already common to see hashes prefixed with a function id:

sha1-a651ec3c4cc479977777f916fcedb221f38aaba1
sha256-aec71a4d4a8f44bc0c3e1133d5544d724b857cf20fe5aaeb1bc4d6e7c1ee68f1

Is this the best way? Maybe it is. But there are some problems:

  1. Hashes tend to be transferred/printed encoded in hex, base32, base64, base54, etc. The name of the hash function may not be compatible with your encoding. (e.g. the hashes above are hex values, but 's' is not a valid hex char) This introduces annoying complexity when merely encoding/decoding hashes for storing / transferring / printing out to uses. (Ugh!) This gets worse when "things expecting a hex hash" that you don' control cannot be used with this scheme.
  2. When storing millions of hashes, the extra byte costs of something like blake2b- may matter. So we might want to use a much narrower prefix. Particularly given that "widely used and accepted secure cryptographic hash functions" tend to change very little over time (by 2014 there's less than 256 that you might seriously consider).

Is there an RFC for this? I haven't found a "Hash Function Suite" like the "Cypher Suite" in TLS (RFC 5246/A.5).

Potential solutions

Use a short prefix mapping to some "crytographic hash function" suite. This already has to be done: the sha1- prefix is more human readable, but probably not a good idea to blindly dispatch a function based on the string sha1. Whitelisting specific strings (a blessed table) already happens.

So what would this look like? For example, suppose sha1 is 0x01

# name prefixed
sha1-0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 # hex
sha1-bpxmpnpkh4h5xsk5bxkh6pc3yj25vcrt # base32
sha1-aef1qioaay9wkladm4f7a6nubty # base58
sha1-c+7hteo/d9vjxq3ufzxbwnxaijm # base64

sha256-2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae # hex
sha256-fqtli23i77di76m3iu6b2mcbgqjuellqmsb37ihzrjpiqytg46xa # base32
sha256-3ymapqcucjxdwprbjfr5mjcpthqfg8pux1txqrem35jj # base58
sha256-lca0a2j/xo/5m0u8htbbnbnclxbkg7+g+ypeigjm564 # base64

# id prefixed (`0x01 for sha1, and `0x02` for sha256)
010beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 # sha1 in hex
aef65r5v5i7q7w6jlug5i7z4lpbhlwukgm # sha1 in base32
4jvyy9wgauheuckrj7szdd2e1vqs # sha1 in base58
aqvux7xqpw/byv0n1h88w8j12ooz # sha1 in base64

022c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae # sha256 in hex
aiwcnndlnd74nd7ztnctyhjqie2bgqrnobsihp5a7gff5cdcm3t24 # sha256 in base32
eryupqi6npzkezrsc1mgaaorxh7tkyy6v7nc8h5t4zeh # sha256 in base58
aiwmtgto/8ap+ztfpb0wqtqtqi1wzio/opmkxohizueu # sha256 in base64

Pros:

  • hash values are consistent with the encoding :)
  • shorter

Cons:

  • numbers are hard to human-read. This is a less strong point, as some ids would be quickly recognizable. e.g. 0x01 is sha1, 0x02 is sha256, etc.
  • prefixing bytes makes encoded values change altogether. :(

on varints

Ideally, for proper future proofing, we want a varint. Though it is to be noted that varints are annoying to parse + slower than fixed-width ints. There are so few "widely used...hash functions" that it may be okay to get away with one byte. Luckily, can wait until we reach 127 functions before we have to decide which one :)

May be able to repurpose utf-8 implementations for this.

** Random UTF-8 question: ** why are the subsequent bytes wasting two bits each?? '10' prefix below.

U+0000    U+007F     1 0xxxxxxx
U+0080    U+07FF     2 110xxxxx  10xxxxxx
U+0800    U+FFFF     3 1110xxxx  10xxxxxx  10xxxxxx
U+10000   U+1FFFFF   4 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
U+200000  U+3FFFFFF  5 111110xx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx
U+4000000 U+7FFFFFFF 6 1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx

From http://en.wikipedia.org/wiki/UTF-8#Description

Is it to keep the code point ranges nice and rounded-ish?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions