A hash function primer I: SHA-1

Today I am starting a new series on hash algorithms, beginning with SHA-1. The goal is to review the algorithm and watch a data example being digested. It can be used as a reference and to test your own implementations for correctness.

So what is a hash function? Well, a hash function maps data of arbitrary length onto a data space of fixed length. For example, mapping every string of characters to its first character. That means “Dolan” is mapped to “D”, “Elmu Fadd” to “E”, “bogs” to “b” etc. This example, apart from being very unimaginative, has the problem that one can easily find two inputs that will produce the same hash, e.g. “Dolan” and “Daisu”. This peculiar property is known as a hash collision and is extremely undesirable when it comes to cryptographic applications, like signatures and authentication.

A lot of better suggestions for hash function have been proposed over the last decades. Among those is SHA-1, which has still a lot of relevance today. Although a theoretical collision attack is known since 2011, there is no published collision available. Instead of repeating all the trivia about SHA-1 I will skip right to the algorithm itself.

In the first step the input m (sometimes called the message) has to be padded such that the total length is a multiple of 64 bytes. It has to be padded by at least one bit and 8 bytes. If the message is already a multiple of 64 bytes to begin with, you need to append another 64 bytes. Let’s assume that m is a multiple of 8 bits long and that the number of bytes is l. That is usually the case. If (l+1)%64 is smaller than 56 then you need to append 65-((l+1)%64) bytes and otherwise 129-((l+1)%64) bytes.
Consider an example

m = "Teh nawt so kwik bronw bogs jumpz ovr teh lzy pruto with an caret."

The length l in bytes is 66. Then (l+1)%64 = 3 and therefore we need to extend m by 65-3 = 62 bytes. The first bit of the pad (or the extension) is set to one. The last 8 bytes of the pad are set equal to the length (in bits) of the message. The rest of the pad is filled with zeroes.
So here is our example in binary representation before padding:

01010100 01100101 01101000 00100000 01101110 01100001 01110111 01110100
00100000 01110011 01101111 00100000 01101011 01110111 01101001 01101011
00100000 01100010 01110010 01101111 01101110 01110111 00100000 01100010
01101111 01100111 01110011 00100000 01101010 01110101 01101101 01110000
01111010 00100000 01101111 01110110 01110010 00100000 01110100 01100101
01101000 00100000 01101100 01111010 01111001 00100000 01110000 01110010
01110101 01110100 01101111 00100000 01110111 01101001 01110100 01101000
00100000 01100001 01101110 00100000 01100011 01100001 01110010 01100101
01110100 00101110

The padded message looks like

01010100 01100101 01101000 00100000 01101110 01100001 01110111 01110100
00100000 01110011 01101111 00100000 01101011 01110111 01101001 01101011
00100000 01100010 01110010 01101111 01101110 01110111 00100000 01100010
01101111 01100111 01110011 00100000 01101010 01110101 01101101 01110000
01111010 00100000 01101111 01110110 01110010 00100000 01110100 01100101
01101000 00100000 01101100 01111010 01111001 00100000 01110000 01110010
01110101 01110100 01101111 00100000 01110111 01101001 01110100 01101000
00100000 01100001 01101110 00100000 01100011 01100001 01110010 01100101
01110100 00101110 10000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000010 00010000

It is now 128 bytes long. The complete last line 0000000000000000000000000000000000000000000000000000001000010000 is the binary representation of 512+16=528, i.e. the bit length of our example string. Now step number one is complete.

The final hash will be a concatenation of five integers h0, h1, h2, h3, h4. At the very beginning they are initialized as (in hexadecimal)

h0 = 0x67452301
h1 = 0xEFCDAB89
h2 = 0x98BADCFE
h3 = 0x10325476
h4 = 0xC3D2E1F0

You may call this an initialization vector. The values will change throughout the algorithm.

For the rest of the algorithm the padded message is subsequently processed in blocks of 64 bytes, which is now possible by construction. Our example has two of those blocks. Let’s consider the first block:

01010100 01100101 01101000 00100000 01101110 01100001 01110111 01110100
00100000 01110011 01101111 00100000 01101011 01110111 01101001 01101011
00100000 01100010 01110010 01101111 01101110 01110111 00100000 01100010
01101111 01100111 01110011 00100000 01101010 01110101 01101101 01110000
01111010 00100000 01101111 01110110 01110010 00100000 01110100 01100101
01101000 00100000 01101100 01111010 01111001 00100000 01110000 01110010
01110101 01110100 01101111 00100000 01110111 01101001 01110100 01101000
00100000 01100001 01101110 00100000 01100011 01100001 01110010 01100101

On now constructs a list of 80 integers, each of size 32 bit. The first 16 entries consist just of our block, where one simply regroups the bit sequences, such that the entries 0 to 15 read

00: 01010100011001010110100000100000
01: 01101110011000010111011101110100
02: 00100000011100110110111100100000
03: 01101011011101110110100101101011
04: 00100000011000100111001001101111
05: 01101110011101110010000001100010
06: 01101111011001110111001100100000
07: 01101010011101010110110101110000
08: 01111010001000000110111101110110
09: 01110010001000000111010001100101
10: 01101000001000000110110001111010
11: 01111001001000000111000001110010
12: 01110101011101000110111100100000
13: 01110111011010010111010001101000
14: 00100000011000010110111000100000
15: 01100011011000010111001001100101

The other entries 16 to 80 are calculated according to

entry(i) = lrotby1(entry[i-3]^entry[i-8]^entry[i-14]^entry[i-16])

where the ‘^‘ stands for the ‘exclusive or’ and ‘lrotby1‘ makes a cyclic rotation by one bit to the left. That means that the leftmost bit is cut from the expression and prepended at the beginning. Example: for the 16th integer one needs


  01110111011010010111010001101000^
  01111010001000000110111101110110^
  00100000011100110110111100100000^
  01010100011001010110100000100000
= 01111001010111110001110000011110

The cyclic permutation of the bits gives 11110010101111100011100000111100. Ok. So let’s give the full list from 16 on:

16: 11110010101111100011100000111100
17: 10101110101011100000100010110100
18: 00010110101000000000011010100000
19: 00011101001111000000001010001111
20: 00101001101111101100110110110111
21: 11001011100101100111111110110100
22: 01010000001101001110000111110010
23: 10100101000101010100110110001110
24: 01010110010100001000100100001000
25: 11101011001101011101101010100011
26: 01011101110000101001000011101001
27: 10001010010010110001111100111010
28: 00101111001111000010110000101001
29: 00000100101110011101001010100001
30: 00010001010000010101000110101000
31: 10001111110011000011011011101100
32: 01101101111011101100101001101011
33: 10010011110011010000001001100000
34: 11011010001000001101101000100101
35: 01100010000111110101000011010100
36: 10001010111101100000010000011001
37: 01100000001101000111010101111101
38: 11101010011101001101001100001100
39: 10010110001101010100101110110000
40: 00001100100100010100110111101110
41: 00110001100011100010100111101010
42: 01111101110101100101101010101010
43: 11000000111110011010000101000011
44: 00001010000010101010000011100101
45: 00101101001011111001011100110101
46: 10101100010001011101001100011000
47: 00000000011111011011111110110011
48: 00101100111000011001010100101011
49: 11011000001100110101000010001100
50: 01011010111110100111011001001010
51: 11011100011001100010001110000011
52: 01100101011101100100111011111001
53: 00000011101010011011111101100101
54: 00101101100011001101110011110011
55: 10000101011000010010011000100001
56: 10111100000111100111101000010100
57: 00001001100100000000100110101100
58: 01010000100011110101010001001001
59: 00011011010111001101111111000011
60: 10010101010100100110100101010001
61: 11111100111010011000011101010100
62: 01101100111010001000101000000111
63: 10010000111110110100000010011111
64: 01101101110110000011110001000010
65: 11000010010110111110000101001000
66: 11111111111100000101100011001011
67: 01010010100101101111111011001111
68: 00111111111001100011010000100110
69: 00001011101000101000110110110111
70: 01011111110110011010010001011111
71: 01000111110110001011011001101000
72: 00010101110101110011111101010001
73: 00011110100111010010011011110001
74: 11111011111010111010011101110110
75: 01000001111010010011001000010011
76: 10110001100000111110001100000011
77: 00111000101101111101101000010101
78: 00111110000000000100000000010010
79: 01001001111101111110100101111001

All right, so that was kind of preparatory. Now we need 7 integer variables and call them

a, b, c, d, e, f, t

We initialize

a = h0
b = h1
c = h2
d = h3
e = h4

t is a temporary variable. Now there will be 80 operations on a, b, c, d, e making use of our 80 element list we just prepared.

  • In the first 20 rounds (i = 0, ..., 19)
    t = lrotby5(a) + ((b & c) | ((~b) & d)) + e + 0x5A827999 + listof80[i]
    e = d
    d = c
    c = lrotby30(b)
    b = a
    a = t

    lrotby30 and lrotby5 cyclically rotate 30 and 5 bits to the left respectively.

  • The next 20 rounds (i= 20, ..., 39) continue with
    t = lrotby5(a) + (b ^ c ^ d) + e + 0x6ED9EBA1 + listof80[i]
    e = d
    d = c
    c = lrotby30(b)
    b = a
    a = t

    It is similar to the first 20 rounds. The difference lies in the red contributions of the sum.

  • The same is true for the rounds i = 40, ..., 59:
    t = lrotby5(a) + ((b & c) | (b & d) | (c & d)) + e + 0x8F1BBCDC + listof80[i]
    e = d
    d = c
    c = lrotby30(b)
    b = a
    a = t
  • The rounds i = 60, ..., 79 are again the same as in the first sector:
    t = lrotby5(a) + (b ^ c ^ d) + e + 0xCA62C1D6 + listof80[i]
    e = d
    d = c
    c = lrotby30(b)
    b = a
    a = t
  • Then the values a, ..., e are added to h0, ..., h4
    h0 += a
    h1 += b
    h2 += c
    h3 += d
    h4 += e

And one continues with the next block with the newly acquired starting values.

Now back to our example. Initially a, ... e are equal to h0, ..., h4, where the binary representation reads

a = 01100111010001010010001100000001
b = 11101111110011011010101110001001
c = 10011000101110101101110011111110
d = 00010000001100100101010001110110
e = 11000011110100101110000111110000

After the first round (i=0) one gets

ROUND 0
a = 11110100000110100000000011010011
b = 01100111010001010010001100000001
c = 01111011111100110110101011100010
d = 10011000101110101101110011111110
e = 00010000001100100101010001110110

So e is now the old d, d is now the old c, b is now the old a, c is obtained by the old b and a rotation of 30 bits to the left. A little less trivial is a, for which one needs

lrotby5(a) + ((b & c) | ((~b) & d)) + e + 0x5A827999 + listof80[0] (old values)

Let us walk through the second contribution

(b & c) = 11101111110011011010101110001001 &
          10011000101110101101110011111110
        = 10001000100010001000100010001000

(~b) & d) = 00010000001100100101010001110110 &
            00010000001100100101010001110110
          = 00010000001100100101010001110110

Bitwise ORing the two expressions

  10001000100010001000100010001000 |
  00010000001100100101010001110110
= 10011000101110101101110011111110

OK. Now we need to add

     11101000101001000110000000101100 [=lrotby5(a)]
   + 10011000101110101101110011111110 [=((b & c) | ((~b) & d))]
   + 11000011110100101110000111110000 [=e]
   + 01011010100000100111100110011001 [=0x5A827999]
   + 01010100011001010110100000100000 [=listof80[0]]
=   110000001010111110011110100101010 [= lrotby5(a)+((b & c)|((~b)&d))]
  + 100011110010101010101101110001001 [=e+0x5A827999]
  +  01010100011001010110100000100000 [=listof80[0]]
=   110000001010111110011110100101010
  + 101110010101110101100001110101001 [=e+0x5A827999+listof80[0]]
=  1011110100000110100000000011010011

The result is too large to be stored in a 32bit integer, so the leftmost two bits are simply cut away and thus one obtains the above value for a. In C/C++ you can simply add the stuff provided you use unsigned integers. You will produce an overflow, but the result is defined just in the way we want it. If (for some reason) you used signed integers, the overflow behavior is not defined and will produce compiler dependent results. Make sure you understand this point! Now one repeats this process according the above prescription until

...
ROUND 19
a = 10001010001111000110001010011001
b = 00110000010111100001101001100001
c = 11011011101111101000110100111001
d = 11100010011001111010101011110100
e = 00101111111110111101111101100011
ROUND 20
a = 00011001101010000010100110011000
b = 10001010001111000110001010011001
c = 01001100000101111000011010011000
d = 11011011101111101000110100111001
e = 11100010011001111010101011110100
...
ROUND 39
a = 01100110001000011100000000011010
b = 10110110110110111000111001111100
c = 00101000110010100000001010111011
d = 00101010111101000011001011100110
e = 10111011101001110101111100111010
ROUND 40
a = 01000110011001100111000001001110
b = 01100110001000011100000000011010
c = 00101101101101101110001110011111
d = 00101000110010100000001010111011
e = 00101010111101000011001011100110
...
ROUND 59
a = 11111010111110000110001110001011
b = 00000010110000110001001101111011
c = 01110100110001010001001101010100
d = 10101010111011110101111010011100
e = 01100010111010111110000110010000
ROUND 60
a = 11111110100101101101110011101001
b = 11111010111110000110001110001011
c = 11000000101100001100010011011110
d = 01110100110001010001001101010100
e = 10101010111011110101111010011100
...
ROUND 79
a = 11011101010011011111101110101111
b = 00010011111101100110011101100011
c = 00101000110011000001101111110000
d = 00101101010110100011100011101111
e = 10110000010010101110000110100100

I have indicated the thresholds where the algorithm changes pace.
Then one adds (and cut the bits that exceed the 32 bit integer range)

h0 + a =  01100111010001010010001100000001
        + 11011101010011011111101110101111
       = 101000100100100110001111010110000
h0 = 01000100100100110001111010110000 = 44931EB0
h1 = h1 + b = 03C412EC
h2 = h2 + c = C186F8EE
h3 = h3 + d = 3D8C8D65
h4 = h4 + e = 741DC394

Fine. We still have a second block to do! It is the second half of our padded message

01110100 00101110 10000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000010 00010000

Our “list of 80 integers” then reads

00: 01110100001011101000000000000000
01: 00000000000000000000000000000000
02: 00000000000000000000000000000000
03: 00000000000000000000000000000000
04: 00000000000000000000000000000000
05: 00000000000000000000000000000000
06: 00000000000000000000000000000000
07: 00000000000000000000000000000000
08: 00000000000000000000000000000000
09: 00000000000000000000000000000000
10: 00000000000000000000000000000000
11: 00000000000000000000000000000000
12: 00000000000000000000000000000000
13: 00000000000000000000000000000000
14: 00000000000000000000000000000000
15: 00000000000000000000001000010000
16: 11101000010111010000000000000000
17: 00000000000000000000000000000000
18: 00000000000000000000010000100000
19: 11010000101110100000000000000001
20: 00000000000000000000000000000000
21: 00000000000000000000100001000000
22: 10100001011101000000000000000011
23: 00000000000000000000010000100000
24: 11010000101110100001000010000001
25: 01000010111010000000000000000111
26: 00000000000000000000000000000000
27: 00000000000000000010000100000000
28: 10000101110100000000000000001110
29: 00000000000000000001010010100000
30: 10010010010100100100001000000110
31: 00001011101000000000110001111101
32: 01110001110011100000000000000010
33: 00000000000000001000010000000000
34: 00010111010000000001000010111010
35: 01000010111010000101001010000111
36: 01001001010010010000100000011010
37: 00101110100000000001000011110100
38: 01000010111010000000000000000111
39: 00000000000000100000000010000000
40: 00011111111010000000000011101111
41: 00000000000000010100101000000000
42: 00100101001001000010000001101001
43: 10111010000000001100111110010000
44: 10111101100101000000000000100100
45: 00000000000010000101000010000000
46: 00110110111010010001101100100110
47: 01101100011011010001000110110011
48: 11110010110111001000000110101110
49: 11101000000000010000111101000010
50: 00101110100000000101001011110100
51: 01001001011010000001100010011010
52: 10111100011010000000111011110110
53: 00000000000101000010010000000000
54: 01000101000000100000011010101000
55: 10100000000011001010101110001011
56: 10010000000010010000101001010001
57: 00101110000001010001100011110100
58: 00101100011110011011001001100100
59: 11000110110100110001101110110110
60: 00110010001000000001101000000000
61: 10000000000100011010111010101110
62: 10001111110010010000111100101100
63: 00101100100000010101111011110100
64: 10011000100010001110111101001011
65: 00000001010010100000000000000000
66: 00100100001000010110000100100101
67: 00101110010011111001000011001110
68: 10010100000000000010010010111101
69: 00001000010100001000000000000000
70: 11101001000110110010011000110110
71: 01101101000100011001001001101100
72: 01011001010100011010111011111100
73: 00000001000011100100101011101000
74: 10101110110100101011011001011010
75: 01100011101110010001011001010100
76: 01010001110011101111011011110010
77: 00010100001001011000110000000000
78: 00111011110001111010000000001011
79: 00100010001010000111010011010100

I think we can be a little less detailed now. The whole machinery runs again with now updated values of h0, ..., h4. Ok, I show one more step:

INITALLY
a = 01000100100100110001111010110000
b = 00000011110001000001001011101100
c = 11000001100001101111100011101110
d = 00111101100011001000110101100101
e = 01110100000111011100001110010100
ROUND 0
a = 00010010101111110011000100100010
b = 01000100100100110001111010110000
c = 00000000111100010000010010111011
d = 11000001100001101111100011101110
e = 00111101100011001000110101100101
...
ROUND 79
a = 01101001011101101000110110001100
b = 01111010100001011100101010100011
c = 00010011111001110100001010111110
d = 10001111010110001010011111101111
e = 01111001110101011010101010011001

and finally one simply concatenates the five hs to obtain in hexadecimal representation (usually lowercase)

(h0, h2, h2, h3, h4) = ae09ac3c7e49dd8fd56e3baccce53554edf36e2d

which is the final hash:

SHA-1("Teh nawt so kwik bronw bogs jumpz ovr teh lzy pruto with an caret.") = ae09ac3c7e49dd8fd56e3baccce53554edf36e2d

Note that this blockwise treatment of the input, i.e. starting from an initialization vector, scrambling with the first block of the message, feeding the output into the scrambling with the second block, etc. is called a Merkle–Damgård construction. It also underlies other hash functions, like MD5 or SHA-2. Let’s call it a day. We have seen how a hash function of practical relevance for today works. If you like, you can go ahead and implement it in a language of your choice. Maybe I will discuss more properties of SHA-1/Merkle–Damgård in a more general context of this series. Stay tuned.

Advertisements

About goobypl5

pizza baker, autodidact, particle physicist
This entry was posted in Hash functions and tagged , , , , . Bookmark the permalink.

Share your thoughts

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s