WoW:StringHash/Analysis: Difference between revisions
(Typo fixing, typos fixed: abberations → aberrations, intepreted → interpreted AWB) |
m (Move page script moved page StringHash/Analysis to StringHash/Analysis without leaving a redirect) |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
<small>[[User Defined Functions]] » [[StringHash]] » Analysis</small> | |||
''This is an analysis of bit transmutation and distribution patterns of the [[StringHash]] function, mainly in comparison to the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup.'' | |||
= Reference code = | |||
<div style="margin-left: 3%;"> | |||
== Java hash == | |||
<div style="margin-left: 3%;"> | |||
for i = 1, len do | |||
counter = (counter*31) + string.byte(text, i) | |||
counter = mod(counter, 4294967296); -- 2^32 | |||
end; | |||
return mod(counter, 4294967296); -- 2^32 | |||
This is the function implemented in the Java [http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#hashCode() String.hashCode] library call. | |||
The multiplication by 31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting data up. | |||
The relation to Bernstein's original "33" function is pretty clear; it simply uses 33 instead of 31. | |||
The mod() is necessary in Lua, otherwise the value turns floating-point and you lose precision after ~10 characters; Lua, at least under x86, uses 64-bit ("double-precision") IEEE 764 floats, which have a mantissa of 52 bits. | |||
This function would be slightly better with a prime modulus, but 2^32 is chosen as a modulus to emulate 32-bit integer math as used in the Java SDK. | |||
</div> | |||
== StringHash == | |||
<div style="margin-left: 3%;"> | |||
for i = 1, len, 3 do | |||
counter = mod(counter*8161, 4294967279) + -- 2^32 - 17: Prime! | |||
(string.byte(text,i)*16776193) + | |||
((string.byte(text,i+1) or (len-i+256))*8372226) + | |||
((string.byte(text,i+2) or (len-i+256))*3932164); | |||
end; | |||
return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop) | |||
This string hash is ~30% faster than the Java hash due to working with 3 characters at a time; the modulus call is fairly expensive compared to the other operations. | |||
The multiplications are all chosen to affect 3 areas of the accumulator at a time: | |||
8161 = 8192 - 32 + 1 : 2<<13, 2<< 5, 2<<0 | |||
16776193 = 16777216 - 1024 + 1 : 2<<24, 2<<10, 2<<0 | |||
8372226 = 8388608 - 16384 + 2 : 2<<23, 2<<14, 2<<5 | |||
3932164 = 4194304 - 262144 + 4 : 2<<22, 2<<18, 2<<10 | |||
It is tempting to make the first multiplication much larger, but beware that the result of (4294967279+(256*29080583))*8161 must fit in a 52-bit mantissa without losing precision. The result is currently a 48-bit number so meets that contraint. | |||
</div> | |||
</div> | |||
= Tests performed and result analysis = | |||
<div style="margin-left: 3%;"> | |||
Below is an array of tests that both algorithms have been submitted to. For a hash analysis, these tests are fairly basic and the result analysis is somewhat unscientific, but they will do to illustrate the difference between the Java hash and StringHash, which is all this document set out to do. | |||
* [[#Performance test|Performance test]] - raw measurement of relative hash speed | |||
* [[#Bit distribution test|Bit distribution test]] - how often is each bit of the 32-bit output used? | |||
* [[#Collision frequency test|Collision frequency test]] - compare observed collision statistics to expected numbers | |||
* [[#Bit transmutation test|Bit transmutation test]] - determine how many bits are changed as a result of 1 input bit changing | |||
* [[#Collision finding test|Collision finding test]] - find and count collisions for exhaustive input sets | |||
== Performance test == | |||
<div style="margin-left: 3%;"> | |||
This test simply hashes the string "abcdefghijklmnopqrstuvwxyz" in a loop 20 000 times and clocks the start and stop time. The loop is included in the resulting time. | |||
* Java hash execution time: 1.719s | |||
* StringHash execution time: 1.313s | |||
=== Result analysis === | |||
<div style="margin-left: 3%;"> | |||
StringHash is slightly faster than the Java hash; likely not enough to matter, but certainly enough that the perceived "increased size" of StringHash can be discounted as an argument against it - it's actually faster! | |||
The reason is of course that the math.mod() call is relatively expensive compared to extracting characters from a string and performing multiplications. StringHash does 3 extract + multiply per mod() call. The Java hash in its standard implementation has to do one mod() call per character. | |||
On the testing computer, a P4 at ~3Ghz, StringHash hashes in the neighborhood of half a megabyte per second; certainly not bad for a scripted language using only standard math library operations! | |||
</div> | |||
</div> | |||
== Bit distribution test == | |||
<div style="margin-left: 3%;"> | |||
This test measures how many times each bit is 1 for a given set of inputs. The target value is for each bit to be 1 50% of the time, and 0 50% of the time, though there will obviously be deviations from this, which are expected and accepted, up to a certain point. | |||
In this test, we feed each hash the same set of inputs: 10000 random strings consisting of "aaaaaaaaa" followed by 15 random bytes. | |||
The results show the absolute difference between 50% and the result, i.e. 49.4 and 50.6 will both be reported as "0.6". Only differences of 0.5% or greater are shown (0.5% is in no way a ceiling of the expected difference; it just cuts down on the amount of output). | |||
=== Inputs 0-255 === | |||
<div style="margin-left: 3%;"> | |||
{| | |||
|- | |||
||Java||StringHash | |||
|- valign=top | |||
| | |||
2: 0.8% | |||
4: 0.6% | |||
9: 0.6% | |||
11: 0.5% | |||
13: 0.5% | |||
14: 0.5% | |||
15: 1.0% | |||
17: 1.1% | |||
20: 0.5% | |||
21: 0.9% | |||
23: 0.8% | |||
25: 0.9% | |||
28: 0.6% | |||
31: 0.7% | |||
32: 1.2% | |||
| | |||
2: 0.5% | |||
3: 0.5% | |||
7: 0.7% | |||
12: 0.6% | |||
15: 0.9% | |||
16: 1.0% | |||
17: 0.9% | |||
20: 0.7% | |||
21: 1.1% | |||
25: 0.6% | |||
29: 0.8% | |||
30: 0.6% | |||
31: 0.7% | |||
|} | |||
</div> | |||
=== Inputs 32-127 === | |||
<div style="margin-left: 3%;"> | |||
{| | |||
|- | |||
||Java||StringHash | |||
|- valign=top | |||
| | |||
2: 0.9% | |||
7: 0.7% | |||
8: 0.6% | |||
9: 0.7% | |||
10: 0.6% | |||
11: 0.6% | |||
12: 1.5% | |||
15: 0.8% | |||
16: 0.5% | |||
18: 0.7% | |||
22: 0.8% | |||
27: 0.5% | |||
29: 0.5% | |||
30: 0.6% | |||
32: 0.5% | |||
| | |||
1: 1.0% | |||
3: 0.5% | |||
6: 0.8% | |||
14: 0.6% | |||
19: 0.5% | |||
23: 0.5% | |||
25: 0.7% | |||
28: 0.5% | |||
|} | |||
</div> | |||
=== Result analysis === | |||
<div style="margin-left: 3%;"> | |||
Both algorithms perform roughly as expected. With these samples, StringHash performs a little bit better, but that is to be expected as it mixes more bits; either way, the differences are not really meaningful. The test only attempts to evaluate if all possible result bits are used efficiently without unexpected "holes" in the output. | |||
A more thorough test would attempt to find biases in different sets of bits, but that is really more meaningful for [[wikipedia:PRNG|PRNG]] testing and [[wikipedia:cryptographic hash function|secure hashes]] than it is for this type of light string hash. | |||
</div> | |||
</div> | |||
== Collision frequency test == | |||
<div style="margin-left: 3%;"> | |||
This test takes the result of 20000 random strings consisting of ASCII characters 0-255, and selects 15 bits for testing. The resulting value becomes 0--32767, which, for anything other than a [[wikipedia:perfect hash|perfect hash]], will result in a fair amount of collisions. We expect the frequency of collisions to match a [[wikipedia:Poisson distribution|Poisson distribution]] and compare the test results to it. | |||
=== Long random string 32-127 bits 1-15 === | |||
<div style="margin-left:3%;"> | |||
Java SH Expected | |||
1: 10940 10981 10863 | |||
2: 6552 6554 6630 | |||
3: 2043 1998 2023 | |||
4: 388 396 411 | |||
5: 65 65 62 | |||
6: 12 6 7 | |||
</div> | |||
=== Long random string 0-255 bits 1-15 === | |||
<div style="margin-left:3%;"> | |||
Java SH Expected | |||
1: 10717 10728 10863 | |||
2: 6796 6724 6630 | |||
3: 1995 2073 2023 | |||
4: 376 420 411 | |||
5: 110 55 62 | |||
6: 6 7 | |||
</div> | |||
=== Long random string 0-255 bits 6-20 === | |||
<div style="margin-left:3%;"> | |||
Java SH Expected | |||
1: 10788 10875 10863 | |||
2: 6768 6628 6630 | |||
3: 1989 1971 2023 | |||
4: 400 460 411 | |||
5: 55 60 62 | |||
6: 6 7 | |||
</div> | |||
=== Long random string 0-255 bits 12-26 === | |||
<div style="margin-left:3%;"> | |||
Java SH Expected | |||
1: 10917 10791 10863 | |||
2: 6630 6612 6630 | |||
3: 1881 2049 2023 | |||
4: 492 452 411 | |||
5: 80 90 62 | |||
6: 6 7 | |||
</div> | |||
=== Long random string 0-255 bits 18-32 === | |||
<div style="margin-left:3%;"> | |||
Java SH Expected | |||
1: 10828 10795 10863 | |||
2: 6590 6860 6630 | |||
3: 2127 1932 2023 | |||
4: 380 360 411 | |||
5: 75 35 62 | |||
6: 18 7 | |||
</div> | |||
=== Result analysis === | |||
<div style="margin-left:3%;"> | |||
Both hashes perform roughly as expected without showing too great variations over different bits; there are one or two aberrations in the higher collision counts, but that may just be the result of a too-small sample size. I will not investigate further as this test is only a rough indication of performance. | |||
A more complete analysis would include testing smaller modulii, and certainly ones that are not a power of 2. A modulus constructed around series involving 31 and 32 will most likely produce "interesting" results for the Java hash, but that is an expected side effect of its construction. | |||
</div> | |||
</div> | |||
== Bit transmutation test == | |||
<div style="margin-left: 3%;"> | |||
This test measures how many bits of the hash changes as a result of a single input bit changing, i.e. how close to an [[wikipedia:avalanche effect|avalanche effect]] we're getting. The result is described as the number of times a given number of bits have been seen to change for the entire run. | |||
The input starts out as a string of 0s (ASCII NULs), flipping bits on and off one at a time. Then a number of runs are executed starting out with random values. | |||
All the below tests work with 13-byte strings, and execute for 50 runs, i.e. one run based on ASCII NULs, and 49 runs with fresh random numbers each time. | |||
A good generic hash algorithm will tend toward flipping 50% of the bits for each input bit changed, though that is obviously unattainable. We do however expect a fair hump around 13-17 bits changed, and very low counts toward the extremes. 0 bits flipping may indicate a broken hash algorithm (the hash result didn't change at all as a result of changing the input). | |||
=== Java, flip all bits in bytes 1 through 13 === | |||
<div style="margin-left: 3%;"> | |||
0: 0 1: 173 2: 208 3: 194 | |||
4: 188 5: 159 6: 160 7: 150 | |||
8: 200 9: 231 10: 228 11: 292 | |||
12: 343 13: 476 14: 503 15: 469 | |||
16: 418 17: 318 18: 210 19: 143 | |||
20: 69 21: 45 22: 11 23: 9 | |||
24: 2 25: 1 26: 0 27: 0 | |||
28: 0 29: 0 30: 0 31: 0 | |||
32: 0 | |||
This test does not show 0 bits changing, but the changes we do see are spread quite flat over the spectrum. There is a hump around 13-17 but not as marked as it should be. We get quite a few single bit changes, which is quite expected given the algorithm. The last few bytes in the string do not end up affecting very much of the end result. | |||
</div> | |||
=== StringHash, flip all bits in bytes 1 through 13 === | |||
<div style="margin-left: 3%;"> | |||
0: 0 1: 0 2: 0 3: 45 | |||
4: 70 5: 77 6: 71 7: 56 | |||
8: 45 9: 53 10: 115 11: 203 | |||
12: 327 13: 405 14: 533 15: 588 | |||
16: 620 17: 593 18: 499 19: 371 | |||
20: 259 21: 146 22: 76 23: 32 | |||
24: 12 25: 3 26: 1 27: 0 | |||
28: 0 29: 0 30: 0 31: 0 | |||
32: 0 | |||
StringHash has a much more pronounced hump around the 16 bits changed mark. There are no single bit changes, and no two-bit changes in the tests. This is expected as each bit changed is guaranteed to affect at least three bits of the hash, even for the last few bytes of the input. | |||
</div> | |||
=== Java, flip all bits in bytes 12 through 13 === | |||
<div style="margin-left: 3%;"> | |||
0: 0 1: 173 2: 208 3: 154 | |||
4: 108 5: 75 6: 53 7: 17 | |||
8: 6 9: 1 10: 3 11: 2 | |||
12: 0 (rest 0) | |||
Here is where the weakness of the hash shows the most: changes in the last few bytes affect relatively few bits, thereby increasing chances for collisions. | |||
</div> | |||
=== StringHash, flip all bits in bytes 12 through 13 === | |||
<div style="margin-left: 3%;"> | |||
0: 0 1: 0 2: 0 3: 45 | |||
4: 70 5: 77 6: 71 7: 56 | |||
8: 36 9: 26 10: 36 11: 40 | |||
12: 54 13: 51 14: 57 15: 49 | |||
16: 42 17: 39 18: 23 19: 15 | |||
20: 9 21: 3 22: 1 23: 0 | |||
24: 0 (rest 0) | |||
While not precisely shining compared to strong hashes, StringHash stirs quite a few more bits over a much larger spectrum. | |||
</div> | |||
=== Result analysis === | |||
<div style="margin-left: 3%;"> | |||
StringHash outperforms the Java hash by distributing bit changes over more result bits, thus reducing the collision chance. | |||
</div> | |||
</div> | |||
== Collision finding test == | |||
<div style="margin-left: 3%;"> | |||
This test exhaustively computes all possible values of a given input set and tests for collisions. The size of the input sets is unfortunately constrained by RAM and CPU, but it is believed to be a fair indication of hash strength. | |||
=== ASCII codes 64-89 (upper case A-Z), 3 bytes === | |||
<div style="margin-left: 3%;"> | |||
JavaHash | |||
17576 non-colliding values | |||
StringHash | |||
17576 non-colliding values | |||
</div> | |||
=== ASCII codes 64-89 (upper case A-Z), 4 bytes === | |||
<div style="margin-left: 3%;"> | |||
JavaHash | |||
456976 non-colliding values | |||
StringHash | |||
456976 non-colliding values | |||
</div> | |||
=== ASCII codes 32-127 (all printable), 2 bytes === | |||
<div style="margin-left: 3%;"> | |||
JavaHash | |||
62 non-colliding values | |||
62 instances of 2 inputs producing the same hash (124 strings) | |||
2638 instances of 3 inputs producing the same hash (7914 strings) | |||
279 instances of 4 inputs producing the same hash (1116 strings) | |||
StringHash | |||
9216 non-colliding values | |||
Already here, with only searching all possible combinations of 2 letters of printable ASCII, the Java hash is showing poor results. | |||
</div> | |||
=== ASCII codes 32-127 (all printable), 3 bytes === | |||
<div style="margin-left: 3%;"> | |||
This is about as exhaustive we can go in Lua without resulting to data compression techniques for the result sets: all possible combinations of 3 letters of printable ASCII for a total of 884736 outputs. | |||
JavaHash | |||
62 non-colliding values | |||
62 instances of 2 inputs producing the same hash (124 strings) | |||
1630 instances of 3 inputs producing the same hash (4890 strings) | |||
224 instances of 4 inputs producing the same hash (896 strings) | |||
62 instances of 5 inputs producing the same hash (310 strings) | |||
1630 instances of 6 inputs producing the same hash (9780 strings) | |||
62 instances of 7 inputs producing the same hash (434 strings) | |||
224 instances of 8 inputs producing the same hash (1792 strings) | |||
68606 instances of 9 inputs producing the same hash (617454 strings) | |||
5214 instances of 10 inputs producing the same hash (52140 strings) | |||
5214 instances of 11 inputs producing the same hash (57354 strings) | |||
9672 instances of 12 inputs producing the same hash (116064 strings) | |||
558 instances of 13 inputs producing the same hash (7254 strings) | |||
558 instances of 14 inputs producing the same hash (7812 strings) | |||
558 instances of 15 inputs producing the same hash (8370 strings) | |||
StringHash | |||
884736 non-colliding values | |||
</div> | |||
=== ASCII codes 32-127 (all printable), 3 bytes, with prefix === | |||
<div style="margin-left: 3%;"> | |||
The same input values as the 32-127 x 3 test, except all prefixed with the string "01234567890123456789". This is to test for loss of accuracy. | |||
JavaHash | |||
62 non-colliding values | |||
62 instances of 2 inputs producing the same hash (124 strings) | |||
1630 instances of 3 inputs producing the same hash (4890 strings) | |||
224 instances of 4 inputs producing the same hash (896 strings) | |||
62 instances of 5 inputs producing the same hash (310 strings) | |||
1630 instances of 6 inputs producing the same hash (9780 strings) | |||
62 instances of 7 inputs producing the same hash (434 strings) | |||
224 instances of 8 inputs producing the same hash (1792 strings) | |||
68606 instances of 9 inputs producing the same hash (617454 strings) | |||
5214 instances of 10 inputs producing the same hash (52140 strings) | |||
5214 instances of 11 inputs producing the same hash (57354 strings) | |||
9672 instances of 12 inputs producing the same hash (116064 strings) | |||
558 instances of 13 inputs producing the same hash (7254 strings) | |||
558 instances of 14 inputs producing the same hash (7812 strings) | |||
558 instances of 15 inputs producing the same hash (8370 strings) | |||
StringHash | |||
884736 non-colliding values | |||
</div> | |||
=== ASCII codes 32-127 (all printable), 3 bytes, with suffix === | |||
<div style="margin-left: 3%;"> | |||
The same input values as the 32-127 x 3 test, except all suffixed with the string "01234567890123456789". This is to test for loss of "old" information (data from earlier on in the string) in the mixing function. | |||
JavaHash | |||
62 non-colliding values | |||
62 instances of 2 inputs producing the same hash (124 strings) | |||
1630 instances of 3 inputs producing the same hash (4890 strings) | |||
224 instances of 4 inputs producing the same hash (896 strings) | |||
62 instances of 5 inputs producing the same hash (310 strings) | |||
1630 instances of 6 inputs producing the same hash (9780 strings) | |||
62 instances of 7 inputs producing the same hash (434 strings) | |||
224 instances of 8 inputs producing the same hash (1792 strings) | |||
68606 instances of 9 inputs producing the same hash (617454 strings) | |||
5214 instances of 10 inputs producing the same hash (52140 strings) | |||
5214 instances of 11 inputs producing the same hash (57354 strings) | |||
9672 instances of 12 inputs producing the same hash (116064 strings) | |||
558 instances of 13 inputs producing the same hash (7254 strings) | |||
558 instances of 14 inputs producing the same hash (7812 strings) | |||
558 instances of 15 inputs producing the same hash (8370 strings) | |||
StringHash | |||
884736 non-colliding values | |||
</div> | |||
=== ASCII codes 64-82 (A-S), 5 bytes === | |||
<div style="margin-left: 3%;"> | |||
The Lua memory churn becomes somewhat ugly here, consuming several hundred megabytes. Even attempting A-Z with the same length brings a machine with 1GB RAM to its knees. | |||
JavaHash | |||
2476099 non-colliding values | |||
StringHash | |||
2476099 non-colliding values | |||
</div> | |||
=== Result analysis === | |||
<div style="margin-left: 3%;"> | |||
The Java hash is shown to produce many more collisions even for rather simple inputs. The only cases where it behaves well are the A-Z cases; as long as the input subset is constricted to 32 unique values, it is collision free for 6 bytes (6 * 5 bits = 30 bits), in other words, it is a perfect hash for e.g. short strings consisting of only upper case ASCII letters A-Z. This is an expected property of the algorithm. Though taken with the rather abysmal results for more generic strings above (or, indeed, any sort of string longer than 6 bytes), the Java hash demonstrably makes for a poor choice for general input. | |||
</div> | |||
</div> | |||
</div> | |||
= Result summary = | |||
<div style="margin-left: 3%;"> | |||
In the two simplest cases: collision frequency and bit distribution, the two hashes seem comparable. But on comparing bit transmutation patterns, StringHash tends toward much better spreads of input bits to output bits. | |||
The theory is confirmed in the exhaustive collision search tests, where StringHash remains collision free for the small data sets used, while the Java hash produces more collisions (double, tripe, quadruple, etc) than it does collision-free results. | |||
Added to this, the Lua implementation of StringHash is slightly faster than the Lua implementation of the Java hash (at least in its most straight-forward implementation). | |||
</div> | |||
= Applicability to other languages = | |||
<div style="margin-left: 3%;"> | |||
[[StringHash]] was designed to be reasonably collision-resistant and fast in a Lua standard library environment. Other interpreted languages capable of 50-bit integer math (i.e. through using 64-bit floats) should also be able to make a fair use of it. | |||
Languages with access to lower level primitives (XOR, ANDs, rotations) would do better to look at e.g. Bob Jenkins' [http://www.burtleburtle.net/bob/c/lookup3.c lookup3] hash, or at least read his [http://www.burtleburtle.net/bob/hash/doobs.html page on hash analysis]. | |||
</div> | |||
= Appendix: Lua hash analysis code = | |||
<div style="margin-left: 3%;"> | |||
<div style="max-width: 80em; margin-right: 2em; height: 20em; overflow: scroll;"> | |||
mod=math.mod; | |||
-- This is the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup. | |||
-- The *31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting | |||
-- The mod is necessary in Lua; otherwise the value turns floating-point and you lose precision after ~10 characters | |||
-- This function would be better with a prime modulus, but 2^32 is chosen as a mod to emulate the 32-bit math used in the original | |||
function JavaHash(text) | |||
local counter = 0; | |||
local len = string.len(text); | |||
for i = 1, len do | |||
counter = (counter*31) + string.byte(text, i) | |||
counter = mod(counter, 4294967296); -- 2^32 | |||
end; | |||
return mod(counter, 4294967296); -- 2^32 | |||
end | |||
-- This string hash is ~30% faster due to working with 3 characters at a time; the mod expression is fairly expensive | |||
-- The multiplications are all chosen to affect 3 areas of the accumulator at a time: | |||
-- 8161 = 8192 - 32 + 1 : 2<<13, 2<< 5, 2<<0 | |||
-- 16776193 = 16777216 - 1024 + 1 : 2<<24, 2<<10, 2<<0 | |||
-- 8372226 = 8388608 - 16384 + 2 : 2<<23, 2<<14, 2<<5 | |||
-- 3932164 = 4194304 - 262144 + 4 : 2<<22, 2<<18, 2<<10 | |||
function StringHash(text) | |||
local counter = 1; | |||
local len = string.len(text); | |||
for i = 1, len, 3 do | |||
counter = mod(counter*8161, 4294967279) + -- 2^32 - 17: Prime! (counter*8161 is at most a 48-bit number, which easily fits in the 52-bit mantissa of a double precision float) | |||
(string.byte(text,i)*16776193) + | |||
((string.byte(text,i+1) or (len-i+256))*8372226) + | |||
((string.byte(text,i+2) or (len-i+256))*3932164); | |||
end; | |||
return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop) | |||
end | |||
-- Utility function: return number of bits changed between a and b | |||
function bitsdiff(a,b) | |||
local n=0; | |||
for i=1,32 do | |||
if(mod(a,2) ~= mod(b,2)) then n=n+1; end | |||
a=math.floor(a/2); | |||
b=math.floor(b/2); | |||
end | |||
return n; | |||
end | |||
function tcopy(to,from) | |||
for k,v in pairs(from) do | |||
to[k]=v; | |||
end | |||
end | |||
-- Bit transmutation test: Count how many bits change as a result of a single bit of input flipping | |||
function transtest(func,num, len, first, last) | |||
local bitsdiffhits = {} | |||
local origstrbytes = { }; | |||
for c=1,len do | |||
table.insert(origstrbytes, 0); | |||
end | |||
math.randomseed(2); | |||
for rep=1,num do | |||
local strbytes = {}; | |||
for c=first,last do | |||
tcopy(strbytes, origstrbytes); | |||
local bitval=1; | |||
for b=1,8 do | |||
local before = func(string.char(unpack(strbytes))); | |||
strbytes[c] = origstrbytes[c]; | |||
strbytes[c] = math.mod(origstrbytes[c] + bitval, 256); | |||
local after = func(string.char(unpack(strbytes))); | |||
local n = bitsdiff(before,after); | |||
bitsdiffhits[n] = (bitsdiffhits[n] or 0) + 1; | |||
bitval=bitval*2; | |||
end | |||
end | |||
for c=1,table.getn(origstrbytes) do | |||
origstrbytes[c]=math.random(0,255); | |||
end | |||
end | |||
local score=0; | |||
for i=0,32 do | |||
score = score - (bitsdiffhits[i] or 0)*(math.abs(16-i)/16); -- 16 = 50% of the bits, which is how many bits we want to change in a perfect distribution | |||
print(string.format(" %2u: %5u", i, bitsdiffhits[i] or 0)); | |||
end | |||
print(" SCORE: "..score); -- 0 is unimaginatively good. negative is bad. | |||
end | |||
-- Performance test: Just measure the time it takes to execute a hash on a fixed string lots of times (including the cost of the loop, obviously) | |||
function perftest(func, num) | |||
local x=os.clock(); | |||
for i=1,num do | |||
func("abcdefghijklmnopqrstuvwxyz"); | |||
end | |||
return os.clock()-x; | |||
end | |||
-- Count the number of single, double, triple etc collisions given a certain manipulation of the resultant hash: | |||
-- div by X, take modulus Y | |||
-- The modulus is there to create an artificial range clamping; testing would take too long otherwise. | |||
-- Should match expected Poisson distribution closely for an unbiased hash. | |||
function collisiontest(func, num, div,modulus, randfunc) | |||
local x = os.clock(); | |||
local outputs = {} | |||
local collisions = {} | |||
math.randomseed(1); | |||
for i=1,num do | |||
str = randfunc(); | |||
local n = math.mod(math.floor(func(str)/div), modulus); | |||
outputs[n] = (outputs[n] or 0)+1; | |||
end | |||
for _,n in pairs(outputs) do | |||
collisions[n] = (collisions[n] or 0) + n; | |||
end | |||
local y = os.clock(); | |||
local sortcol = {} | |||
for k,v in pairs(collisions) do | |||
table.insert(sortcol, k); | |||
end | |||
table.sort(sortcol); | |||
-- Compute expected Poisson distribution | |||
expect = {} | |||
lambda = num/modulus; | |||
fly = lambda * math.exp(-lambda); | |||
for c=1,10 do | |||
expect[c] = fly * modulus; | |||
fly = fly * lambda / c; | |||
end | |||
-- Output results and expected results | |||
for _,k in ipairs(sortcol) do | |||
print(string.format(" %2u: %5u (expected %5u)", k, collisions[k], expect[k] or -1)); | |||
end | |||
return y-x; | |||
end | |||
-- Bit distribution test: Measure number of times each bit is set, should average out to 50% chance in the end | |||
-- Will be presented as the absolute difference from 50% | |||
function bitdistribtest(func,num,lo,hi,dispmin) | |||
math.randomseed(1); | |||
local bitsset={}; | |||
for i=1,num do | |||
str = string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), | |||
math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), | |||
math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi)); | |||
local n = func(str); | |||
for k=1,32 do | |||
if(mod(n,2)==1) then | |||
bitsset[k] = (bitsset[k] or 0)+1; | |||
end | |||
n=math.floor(n/2); | |||
end | |||
end | |||
for k=1,32 do | |||
local diff = math.abs(50-(bitsset[k] or 0)/num*100); | |||
if(diff > dispmin) then | |||
print(string.format(" %2u: %.1f%%", k, diff)); | |||
end | |||
end | |||
end | |||
-- Collision finder test: Loop all possible values in given set and count collisions (yes, this is slow, and uses quite a bit of ram) | |||
function collisionfinder(func, lo, hi, bytes, prefix, suffix) | |||
local str; | |||
local coll={}; | |||
prefix = prefix or ""; | |||
suffix = suffix or ""; | |||
if(bytes==2) then | |||
for b=lo,hi do | |||
for a=lo,hi do | |||
str = prefix .. string.char(a,b) .. suffix; | |||
local n = func(str); | |||
coll[n] = (coll[n] or 0) + 1; | |||
end | |||
end | |||
elseif(bytes==3) then | |||
for c=lo,hi do | |||
for b=lo,hi do | |||
for a=lo,hi do | |||
str = prefix .. string.char(a,b,c) .. suffix; | |||
local n = func(str); | |||
coll[n] = (coll[n] or 0) + 1; | |||
end | |||
end | |||
end | |||
elseif(bytes==4) then | |||
for d=lo,hi do | |||
for c=lo,hi do | |||
for b=lo,hi do | |||
for a=lo,hi do | |||
str = prefix .. string.char(a,b,c,d) .. suffix; | |||
local n = func(str); | |||
coll[n] = (coll[n] or 0) + 1; | |||
end | |||
end | |||
end | |||
end | |||
elseif(bytes==5) then | |||
for e=lo,hi do | |||
for d=lo,hi do | |||
for c=lo,hi do | |||
for b=lo,hi do | |||
for a=lo,hi do | |||
str = prefix .. string.char(a,b,c,d,e) .. suffix; | |||
local n = func(str); | |||
coll[n] = (coll[n] or 0) + 1; | |||
end | |||
end | |||
end | |||
end | |||
end | |||
end | |||
local collcnt = {}; | |||
for k,n in coll do | |||
collcnt[n] = (collcnt[n] or 0) + 1; | |||
end | |||
for colls, count in pairs(collcnt) do | |||
if(colls==1) then | |||
print(string.format(" %5u non-colliding values", count)); | |||
else | |||
print(string.format(" %5u instances of %u inputs producing the same hash (%u strings)", count, colls, count*colls)); | |||
end | |||
end | |||
end | |||
--------------------------------------------------------------------------- | |||
-- Run the tests | |||
if(true) then | |||
local colls; | |||
local function longrand(lo,hi) | |||
return function() | |||
local ret = ""; | |||
while colls[ret] do | |||
ret = "aaaaaaaaa"..string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), | |||
math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), | |||
math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi)); | |||
end | |||
colls[ret] = true; | |||
return ret; | |||
end | |||
end | |||
print ""; | |||
print ""; | |||
print "Collision test - long random string 32-127 (20000 values into 32768 buckets)"; | |||
print "------------------------------------------" | |||
print("JavaHash:"); | |||
colls = {}; | |||
print(collisiontest(JavaHash, 20000, 1,32768, longrand(32,127))); | |||
print("StringHash:"); | |||
colls = {}; | |||
print(collisiontest(StringHash, 20000, 1,32768, longrand(32,127))); | |||
print ""; | |||
print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 1-15)"; | |||
print "------------------------------------------" | |||
print("JavaHash:"); | |||
colls = {}; | |||
print(collisiontest(JavaHash, 20000, 1,32768, longrand(0,255))); | |||
print("StringHash:"); | |||
colls = {}; | |||
print(collisiontest(StringHash, 20000, 1,32768, longrand(0,255))); | |||
print ""; | |||
print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 6-20)"; | |||
print "------------------------------------------" | |||
print("JavaHash:"); | |||
colls = {}; | |||
print(collisiontest(JavaHash, 20000, math.pow(2,5),32768, longrand(0,255))); | |||
print("StringHash:"); | |||
colls = {}; | |||
print(collisiontest(StringHash, 20000, math.pow(2,5),32768, longrand(0,255))); | |||
print ""; | |||
print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 12-26)"; | |||
print "------------------------------------------" | |||
print("JavaHash:"); | |||
colls = {}; | |||
print(collisiontest(JavaHash, 20000, math.pow(2,11),32768, longrand(0,255))); | |||
print("StringHash:"); | |||
colls = {}; | |||
print(collisiontest(StringHash, 20000, math.pow(2,11),32768, longrand(0,255))); | |||
print ""; | |||
print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 18-32)"; | |||
print "------------------------------------------" | |||
print("JavaHash:"); | |||
colls = {}; | |||
print(collisiontest(JavaHash, 20000, math.pow(2,17),32768, longrand(0,255))); | |||
print("StringHash:"); | |||
colls = {}; | |||
print(collisiontest(StringHash, 20000, math.pow(2,17),32768, longrand(0,255))); | |||
end | |||
if(true) then | |||
print ""; | |||
print ""; | |||
print "Performance test"; | |||
print "-----------------" | |||
print("JavaHash speed: ", perftest(JavaHash, 20000)); | |||
print("StringHash speed: ", perftest(StringHash, 20000)); | |||
end | |||
if(true) then | |||
print ""; | |||
print ""; | |||
print "Bit transmutation test (how many bits change if a single bit changes?)"; | |||
print "----------------------"; | |||
print(""); | |||
print("JavaHash 13 bytes total, change 1-13:"); | |||
transtest(JavaHash,50, 13, 1,13); | |||
print(""); | |||
print("StringHash 13 bytes total, change 1-13:"); | |||
transtest(StringHash,50, 13, 1,13); | |||
print(""); | |||
print("JavaHash 13 bytes total, change 1-2:"); | |||
transtest(JavaHash,50, 13, 1,2); | |||
print(""); | |||
print("StringHash 13 bytes total, change 1-2:"); | |||
transtest(StringHash,50, 13, 1,2); | |||
print(""); | |||
print("JavaHash 13 bytes total, change 12-13:"); | |||
transtest(JavaHash,50, 13, 12,13); | |||
print(""); | |||
print("StringHash 13 bytes total, change 12-13:"); | |||
transtest(StringHash,50, 13, 12,13); | |||
end | |||
if(true) then | |||
print ""; | |||
print ""; | |||
print "Bit distribution test (what's the chance of each bit being set? expressed as abs diff from 50%)"; | |||
print "---------------------"; | |||
print(""); | |||
print("JavaHash 0-255, 10000 loops: (showing diffs>0.4%)"); | |||
bitdistribtest(JavaHash,10000,0,255, 0.4); | |||
print(""); | |||
print("StringHash 0-255, 10000 loops: (showing diffs>0.4%)"); | |||
bitdistribtest(StringHash,10000,0,255, 0.4); | |||
print(""); | |||
print("JavaHash 32-127, 10000 loops: (showing diffs>0.4%)"); | |||
bitdistribtest(JavaHash,10000,32,127, 0.4); | |||
print(""); | |||
print("StringHash 32-127, 10000 loops: (showing diffs>0.4%)"); | |||
bitdistribtest(StringHash,10000,32,127, 0.4); | |||
end | |||
if(true) then | |||
print ""; | |||
print ""; | |||
print "Collision finder (run through all possible values of a set and count collisions)"; | |||
print "----------------"; | |||
print(""); | |||
print("JavaHash, 64-89 x 3 bytes"); | |||
collisionfinder(JavaHash, 64, 89, 3); | |||
print(""); | |||
print("StringHash, 64-89 x 3 bytes"); | |||
collisionfinder(StringHash, 64, 89, 3); | |||
print(""); | |||
print(""); | |||
print("JavaHash, 64-89 x 4 bytes"); | |||
collisionfinder(JavaHash, 64, 89, 4); | |||
print(""); | |||
print("StringHash, 64-89 x 4 bytes"); | |||
collisionfinder(StringHash, 64, 89, 4); | |||
print(""); | |||
print(""); | |||
print("JavaHash, 32-127 x 2 bytes"); | |||
collisionfinder(JavaHash, 32, 127, 2); | |||
print(""); | |||
print("StringHash, 32-127 x 2 bytes"); | |||
collisionfinder(StringHash, 32, 127, 2); | |||
print(""); | |||
print(""); | |||
print("JavaHash, 32-127 x 3 bytes"); | |||
collisionfinder(JavaHash, 32, 127, 3); | |||
print(""); | |||
print("StringHash, 32-127 x 3 bytes"); | |||
collisionfinder(StringHash, 32, 127, 3); | |||
print(""); | |||
print(""); | |||
print("JavaHash, 32-127 x 3 bytes, prefix '01234567890123456789'"); | |||
collisionfinder(JavaHash, 32, 127, 3, "01234567890123456789"); | |||
print(""); | |||
print("StringHash, 32-127 x 3 bytes, prefix '01234567890123456789'"); | |||
collisionfinder(StringHash, 32, 127, 3, "01234567890123456789"); | |||
print(""); | |||
print(""); | |||
print("JavaHash, 32-127 x 3 bytes, suffix '01234567890123456789'"); | |||
collisionfinder(JavaHash, 32, 127, 3, "", "01234567890123456789"); | |||
print(""); | |||
print("StringHash, 32-127 x 3 bytes, suffix '01234567890123456789'"); | |||
collisionfinder(StringHash, 32, 127, 3, "", "01234567890123456789"); | |||
print(""); -- This test takes a looooong time | |||
if(true) then | |||
print(""); | |||
print("JavaHash, 64-82 x 5 bytes"); | |||
collisionfinder(JavaHash, 64, 82, 5); | |||
print(""); | |||
print("StringHash, 64-82 x 5 bytes"); | |||
collisionfinder(StringHash, 64, 82, 5); | |||
end | |||
end | |||
</div> | |||
</div> | |||
[[Category:User defined functions]] |
Latest revision as of 04:48, 15 August 2023
User Defined Functions » StringHash » Analysis
This is an analysis of bit transmutation and distribution patterns of the StringHash function, mainly in comparison to the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup.
Reference code[edit]
Java hash[edit]
for i = 1, len do counter = (counter*31) + string.byte(text, i) counter = mod(counter, 4294967296); -- 2^32 end; return mod(counter, 4294967296); -- 2^32
This is the function implemented in the Java String.hashCode library call.
The multiplication by 31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting data up. The relation to Bernstein's original "33" function is pretty clear; it simply uses 33 instead of 31.
The mod() is necessary in Lua, otherwise the value turns floating-point and you lose precision after ~10 characters; Lua, at least under x86, uses 64-bit ("double-precision") IEEE 764 floats, which have a mantissa of 52 bits.
This function would be slightly better with a prime modulus, but 2^32 is chosen as a modulus to emulate 32-bit integer math as used in the Java SDK.
StringHash[edit]
for i = 1, len, 3 do counter = mod(counter*8161, 4294967279) + -- 2^32 - 17: Prime! (string.byte(text,i)*16776193) + ((string.byte(text,i+1) or (len-i+256))*8372226) + ((string.byte(text,i+2) or (len-i+256))*3932164); end; return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop)
This string hash is ~30% faster than the Java hash due to working with 3 characters at a time; the modulus call is fairly expensive compared to the other operations.
The multiplications are all chosen to affect 3 areas of the accumulator at a time:
8161 = 8192 - 32 + 1 : 2<<13, 2<< 5, 2<<0 16776193 = 16777216 - 1024 + 1 : 2<<24, 2<<10, 2<<0 8372226 = 8388608 - 16384 + 2 : 2<<23, 2<<14, 2<<5 3932164 = 4194304 - 262144 + 4 : 2<<22, 2<<18, 2<<10
It is tempting to make the first multiplication much larger, but beware that the result of (4294967279+(256*29080583))*8161 must fit in a 52-bit mantissa without losing precision. The result is currently a 48-bit number so meets that contraint.
Tests performed and result analysis[edit]
Below is an array of tests that both algorithms have been submitted to. For a hash analysis, these tests are fairly basic and the result analysis is somewhat unscientific, but they will do to illustrate the difference between the Java hash and StringHash, which is all this document set out to do.
- Performance test - raw measurement of relative hash speed
- Bit distribution test - how often is each bit of the 32-bit output used?
- Collision frequency test - compare observed collision statistics to expected numbers
- Bit transmutation test - determine how many bits are changed as a result of 1 input bit changing
- Collision finding test - find and count collisions for exhaustive input sets
Performance test[edit]
This test simply hashes the string "abcdefghijklmnopqrstuvwxyz" in a loop 20 000 times and clocks the start and stop time. The loop is included in the resulting time.
- Java hash execution time: 1.719s
- StringHash execution time: 1.313s
Result analysis[edit]
StringHash is slightly faster than the Java hash; likely not enough to matter, but certainly enough that the perceived "increased size" of StringHash can be discounted as an argument against it - it's actually faster!
The reason is of course that the math.mod() call is relatively expensive compared to extracting characters from a string and performing multiplications. StringHash does 3 extract + multiply per mod() call. The Java hash in its standard implementation has to do one mod() call per character.
On the testing computer, a P4 at ~3Ghz, StringHash hashes in the neighborhood of half a megabyte per second; certainly not bad for a scripted language using only standard math library operations!
Bit distribution test[edit]
This test measures how many times each bit is 1 for a given set of inputs. The target value is for each bit to be 1 50% of the time, and 0 50% of the time, though there will obviously be deviations from this, which are expected and accepted, up to a certain point.
In this test, we feed each hash the same set of inputs: 10000 random strings consisting of "aaaaaaaaa" followed by 15 random bytes.
The results show the absolute difference between 50% and the result, i.e. 49.4 and 50.6 will both be reported as "0.6". Only differences of 0.5% or greater are shown (0.5% is in no way a ceiling of the expected difference; it just cuts down on the amount of output).
Inputs 0-255[edit]
Java | StringHash |
2: 0.8% 4: 0.6% 9: 0.6% 11: 0.5% 13: 0.5% 14: 0.5% 15: 1.0% 17: 1.1% 20: 0.5% 21: 0.9% 23: 0.8% 25: 0.9% 28: 0.6% 31: 0.7% 32: 1.2% |
2: 0.5% 3: 0.5% 7: 0.7% 12: 0.6% 15: 0.9% 16: 1.0% 17: 0.9% 20: 0.7% 21: 1.1% 25: 0.6% 29: 0.8% 30: 0.6% 31: 0.7% |
Inputs 32-127[edit]
Java | StringHash |
2: 0.9% 7: 0.7% 8: 0.6% 9: 0.7% 10: 0.6% 11: 0.6% 12: 1.5% 15: 0.8% 16: 0.5% 18: 0.7% 22: 0.8% 27: 0.5% 29: 0.5% 30: 0.6% 32: 0.5% |
1: 1.0% 3: 0.5% 6: 0.8% 14: 0.6% 19: 0.5% 23: 0.5% 25: 0.7% 28: 0.5% |
Result analysis[edit]
Both algorithms perform roughly as expected. With these samples, StringHash performs a little bit better, but that is to be expected as it mixes more bits; either way, the differences are not really meaningful. The test only attempts to evaluate if all possible result bits are used efficiently without unexpected "holes" in the output.
A more thorough test would attempt to find biases in different sets of bits, but that is really more meaningful for PRNG testing and secure hashes than it is for this type of light string hash.
Collision frequency test[edit]
This test takes the result of 20000 random strings consisting of ASCII characters 0-255, and selects 15 bits for testing. The resulting value becomes 0--32767, which, for anything other than a perfect hash, will result in a fair amount of collisions. We expect the frequency of collisions to match a Poisson distribution and compare the test results to it.
Long random string 32-127 bits 1-15[edit]
Java SH Expected 1: 10940 10981 10863 2: 6552 6554 6630 3: 2043 1998 2023 4: 388 396 411 5: 65 65 62 6: 12 6 7
Long random string 0-255 bits 1-15[edit]
Java SH Expected 1: 10717 10728 10863 2: 6796 6724 6630 3: 1995 2073 2023 4: 376 420 411 5: 110 55 62 6: 6 7
Long random string 0-255 bits 6-20[edit]
Java SH Expected 1: 10788 10875 10863 2: 6768 6628 6630 3: 1989 1971 2023 4: 400 460 411 5: 55 60 62 6: 6 7
Long random string 0-255 bits 12-26[edit]
Java SH Expected 1: 10917 10791 10863 2: 6630 6612 6630 3: 1881 2049 2023 4: 492 452 411 5: 80 90 62 6: 6 7
Long random string 0-255 bits 18-32[edit]
Java SH Expected 1: 10828 10795 10863 2: 6590 6860 6630 3: 2127 1932 2023 4: 380 360 411 5: 75 35 62 6: 18 7
Result analysis[edit]
Both hashes perform roughly as expected without showing too great variations over different bits; there are one or two aberrations in the higher collision counts, but that may just be the result of a too-small sample size. I will not investigate further as this test is only a rough indication of performance.
A more complete analysis would include testing smaller modulii, and certainly ones that are not a power of 2. A modulus constructed around series involving 31 and 32 will most likely produce "interesting" results for the Java hash, but that is an expected side effect of its construction.
Bit transmutation test[edit]
This test measures how many bits of the hash changes as a result of a single input bit changing, i.e. how close to an avalanche effect we're getting. The result is described as the number of times a given number of bits have been seen to change for the entire run.
The input starts out as a string of 0s (ASCII NULs), flipping bits on and off one at a time. Then a number of runs are executed starting out with random values.
All the below tests work with 13-byte strings, and execute for 50 runs, i.e. one run based on ASCII NULs, and 49 runs with fresh random numbers each time.
A good generic hash algorithm will tend toward flipping 50% of the bits for each input bit changed, though that is obviously unattainable. We do however expect a fair hump around 13-17 bits changed, and very low counts toward the extremes. 0 bits flipping may indicate a broken hash algorithm (the hash result didn't change at all as a result of changing the input).
Java, flip all bits in bytes 1 through 13[edit]
0: 0 1: 173 2: 208 3: 194 4: 188 5: 159 6: 160 7: 150 8: 200 9: 231 10: 228 11: 292 12: 343 13: 476 14: 503 15: 469 16: 418 17: 318 18: 210 19: 143 20: 69 21: 45 22: 11 23: 9 24: 2 25: 1 26: 0 27: 0 28: 0 29: 0 30: 0 31: 0 32: 0
This test does not show 0 bits changing, but the changes we do see are spread quite flat over the spectrum. There is a hump around 13-17 but not as marked as it should be. We get quite a few single bit changes, which is quite expected given the algorithm. The last few bytes in the string do not end up affecting very much of the end result.
StringHash, flip all bits in bytes 1 through 13[edit]
0: 0 1: 0 2: 0 3: 45 4: 70 5: 77 6: 71 7: 56 8: 45 9: 53 10: 115 11: 203 12: 327 13: 405 14: 533 15: 588 16: 620 17: 593 18: 499 19: 371 20: 259 21: 146 22: 76 23: 32 24: 12 25: 3 26: 1 27: 0 28: 0 29: 0 30: 0 31: 0 32: 0
StringHash has a much more pronounced hump around the 16 bits changed mark. There are no single bit changes, and no two-bit changes in the tests. This is expected as each bit changed is guaranteed to affect at least three bits of the hash, even for the last few bytes of the input.
Java, flip all bits in bytes 12 through 13[edit]
0: 0 1: 173 2: 208 3: 154 4: 108 5: 75 6: 53 7: 17 8: 6 9: 1 10: 3 11: 2 12: 0 (rest 0)
Here is where the weakness of the hash shows the most: changes in the last few bytes affect relatively few bits, thereby increasing chances for collisions.
StringHash, flip all bits in bytes 12 through 13[edit]
0: 0 1: 0 2: 0 3: 45 4: 70 5: 77 6: 71 7: 56 8: 36 9: 26 10: 36 11: 40 12: 54 13: 51 14: 57 15: 49 16: 42 17: 39 18: 23 19: 15 20: 9 21: 3 22: 1 23: 0 24: 0 (rest 0)
While not precisely shining compared to strong hashes, StringHash stirs quite a few more bits over a much larger spectrum.
Result analysis[edit]
StringHash outperforms the Java hash by distributing bit changes over more result bits, thus reducing the collision chance.
Collision finding test[edit]
This test exhaustively computes all possible values of a given input set and tests for collisions. The size of the input sets is unfortunately constrained by RAM and CPU, but it is believed to be a fair indication of hash strength.
ASCII codes 64-89 (upper case A-Z), 3 bytes[edit]
JavaHash 17576 non-colliding values StringHash 17576 non-colliding values
ASCII codes 64-89 (upper case A-Z), 4 bytes[edit]
JavaHash 456976 non-colliding values StringHash 456976 non-colliding values
ASCII codes 32-127 (all printable), 2 bytes[edit]
JavaHash 62 non-colliding values 62 instances of 2 inputs producing the same hash (124 strings) 2638 instances of 3 inputs producing the same hash (7914 strings) 279 instances of 4 inputs producing the same hash (1116 strings) StringHash 9216 non-colliding values
Already here, with only searching all possible combinations of 2 letters of printable ASCII, the Java hash is showing poor results.
ASCII codes 32-127 (all printable), 3 bytes[edit]
This is about as exhaustive we can go in Lua without resulting to data compression techniques for the result sets: all possible combinations of 3 letters of printable ASCII for a total of 884736 outputs.
JavaHash 62 non-colliding values 62 instances of 2 inputs producing the same hash (124 strings) 1630 instances of 3 inputs producing the same hash (4890 strings) 224 instances of 4 inputs producing the same hash (896 strings) 62 instances of 5 inputs producing the same hash (310 strings) 1630 instances of 6 inputs producing the same hash (9780 strings) 62 instances of 7 inputs producing the same hash (434 strings) 224 instances of 8 inputs producing the same hash (1792 strings) 68606 instances of 9 inputs producing the same hash (617454 strings) 5214 instances of 10 inputs producing the same hash (52140 strings) 5214 instances of 11 inputs producing the same hash (57354 strings) 9672 instances of 12 inputs producing the same hash (116064 strings) 558 instances of 13 inputs producing the same hash (7254 strings) 558 instances of 14 inputs producing the same hash (7812 strings) 558 instances of 15 inputs producing the same hash (8370 strings) StringHash 884736 non-colliding values
ASCII codes 32-127 (all printable), 3 bytes, with prefix[edit]
The same input values as the 32-127 x 3 test, except all prefixed with the string "01234567890123456789". This is to test for loss of accuracy.
JavaHash 62 non-colliding values 62 instances of 2 inputs producing the same hash (124 strings) 1630 instances of 3 inputs producing the same hash (4890 strings) 224 instances of 4 inputs producing the same hash (896 strings) 62 instances of 5 inputs producing the same hash (310 strings) 1630 instances of 6 inputs producing the same hash (9780 strings) 62 instances of 7 inputs producing the same hash (434 strings) 224 instances of 8 inputs producing the same hash (1792 strings) 68606 instances of 9 inputs producing the same hash (617454 strings) 5214 instances of 10 inputs producing the same hash (52140 strings) 5214 instances of 11 inputs producing the same hash (57354 strings) 9672 instances of 12 inputs producing the same hash (116064 strings) 558 instances of 13 inputs producing the same hash (7254 strings) 558 instances of 14 inputs producing the same hash (7812 strings) 558 instances of 15 inputs producing the same hash (8370 strings) StringHash 884736 non-colliding values
ASCII codes 32-127 (all printable), 3 bytes, with suffix[edit]
The same input values as the 32-127 x 3 test, except all suffixed with the string "01234567890123456789". This is to test for loss of "old" information (data from earlier on in the string) in the mixing function.
JavaHash 62 non-colliding values 62 instances of 2 inputs producing the same hash (124 strings) 1630 instances of 3 inputs producing the same hash (4890 strings) 224 instances of 4 inputs producing the same hash (896 strings) 62 instances of 5 inputs producing the same hash (310 strings) 1630 instances of 6 inputs producing the same hash (9780 strings) 62 instances of 7 inputs producing the same hash (434 strings) 224 instances of 8 inputs producing the same hash (1792 strings) 68606 instances of 9 inputs producing the same hash (617454 strings) 5214 instances of 10 inputs producing the same hash (52140 strings) 5214 instances of 11 inputs producing the same hash (57354 strings) 9672 instances of 12 inputs producing the same hash (116064 strings) 558 instances of 13 inputs producing the same hash (7254 strings) 558 instances of 14 inputs producing the same hash (7812 strings) 558 instances of 15 inputs producing the same hash (8370 strings) StringHash 884736 non-colliding values
ASCII codes 64-82 (A-S), 5 bytes[edit]
The Lua memory churn becomes somewhat ugly here, consuming several hundred megabytes. Even attempting A-Z with the same length brings a machine with 1GB RAM to its knees.
JavaHash 2476099 non-colliding values StringHash 2476099 non-colliding values
Result analysis[edit]
The Java hash is shown to produce many more collisions even for rather simple inputs. The only cases where it behaves well are the A-Z cases; as long as the input subset is constricted to 32 unique values, it is collision free for 6 bytes (6 * 5 bits = 30 bits), in other words, it is a perfect hash for e.g. short strings consisting of only upper case ASCII letters A-Z. This is an expected property of the algorithm. Though taken with the rather abysmal results for more generic strings above (or, indeed, any sort of string longer than 6 bytes), the Java hash demonstrably makes for a poor choice for general input.
Result summary[edit]
In the two simplest cases: collision frequency and bit distribution, the two hashes seem comparable. But on comparing bit transmutation patterns, StringHash tends toward much better spreads of input bits to output bits.
The theory is confirmed in the exhaustive collision search tests, where StringHash remains collision free for the small data sets used, while the Java hash produces more collisions (double, tripe, quadruple, etc) than it does collision-free results.
Added to this, the Lua implementation of StringHash is slightly faster than the Lua implementation of the Java hash (at least in its most straight-forward implementation).
Applicability to other languages[edit]
StringHash was designed to be reasonably collision-resistant and fast in a Lua standard library environment. Other interpreted languages capable of 50-bit integer math (i.e. through using 64-bit floats) should also be able to make a fair use of it.
Languages with access to lower level primitives (XOR, ANDs, rotations) would do better to look at e.g. Bob Jenkins' lookup3 hash, or at least read his page on hash analysis.
Appendix: Lua hash analysis code[edit]
mod=math.mod; -- This is the "Java hash", a variant of Dan Bernstein's old "33" hash posted long ago in the comp.lang.c newsgroup. -- The *31 basically translates to "new = old<<5 - old", i.e. you never completely lose data from earlier bytes as you would by only shifting -- The mod is necessary in Lua; otherwise the value turns floating-point and you lose precision after ~10 characters -- This function would be better with a prime modulus, but 2^32 is chosen as a mod to emulate the 32-bit math used in the original function JavaHash(text) local counter = 0; local len = string.len(text); for i = 1, len do counter = (counter*31) + string.byte(text, i) counter = mod(counter, 4294967296); -- 2^32 end; return mod(counter, 4294967296); -- 2^32 end -- This string hash is ~30% faster due to working with 3 characters at a time; the mod expression is fairly expensive -- The multiplications are all chosen to affect 3 areas of the accumulator at a time: -- 8161 = 8192 - 32 + 1 : 2<<13, 2<< 5, 2<<0 -- 16776193 = 16777216 - 1024 + 1 : 2<<24, 2<<10, 2<<0 -- 8372226 = 8388608 - 16384 + 2 : 2<<23, 2<<14, 2<<5 -- 3932164 = 4194304 - 262144 + 4 : 2<<22, 2<<18, 2<<10 function StringHash(text) local counter = 1; local len = string.len(text); for i = 1, len, 3 do counter = mod(counter*8161, 4294967279) + -- 2^32 - 17: Prime! (counter*8161 is at most a 48-bit number, which easily fits in the 52-bit mantissa of a double precision float) (string.byte(text,i)*16776193) + ((string.byte(text,i+1) or (len-i+256))*8372226) + ((string.byte(text,i+2) or (len-i+256))*3932164); end; return mod(counter, 4294967291); -- 2^32 - 5: Prime (and different from the prime in the loop) end -- Utility function: return number of bits changed between a and b function bitsdiff(a,b) local n=0; for i=1,32 do if(mod(a,2) ~= mod(b,2)) then n=n+1; end a=math.floor(a/2); b=math.floor(b/2); end return n; end function tcopy(to,from) for k,v in pairs(from) do to[k]=v; end end -- Bit transmutation test: Count how many bits change as a result of a single bit of input flipping function transtest(func,num, len, first, last) local bitsdiffhits = {} local origstrbytes = { }; for c=1,len do table.insert(origstrbytes, 0); end math.randomseed(2); for rep=1,num do local strbytes = {}; for c=first,last do tcopy(strbytes, origstrbytes); local bitval=1; for b=1,8 do local before = func(string.char(unpack(strbytes))); strbytes[c] = origstrbytes[c]; strbytes[c] = math.mod(origstrbytes[c] + bitval, 256); local after = func(string.char(unpack(strbytes))); local n = bitsdiff(before,after); bitsdiffhits[n] = (bitsdiffhits[n] or 0) + 1; bitval=bitval*2; end end for c=1,table.getn(origstrbytes) do origstrbytes[c]=math.random(0,255); end end local score=0; for i=0,32 do score = score - (bitsdiffhits[i] or 0)*(math.abs(16-i)/16); -- 16 = 50% of the bits, which is how many bits we want to change in a perfect distribution print(string.format(" %2u: %5u", i, bitsdiffhits[i] or 0)); end print(" SCORE: "..score); -- 0 is unimaginatively good. negative is bad. end -- Performance test: Just measure the time it takes to execute a hash on a fixed string lots of times (including the cost of the loop, obviously) function perftest(func, num) local x=os.clock(); for i=1,num do func("abcdefghijklmnopqrstuvwxyz"); end return os.clock()-x; end -- Count the number of single, double, triple etc collisions given a certain manipulation of the resultant hash: -- div by X, take modulus Y -- The modulus is there to create an artificial range clamping; testing would take too long otherwise. -- Should match expected Poisson distribution closely for an unbiased hash. function collisiontest(func, num, div,modulus, randfunc) local x = os.clock(); local outputs = {} local collisions = {} math.randomseed(1); for i=1,num do str = randfunc(); local n = math.mod(math.floor(func(str)/div), modulus); outputs[n] = (outputs[n] or 0)+1; end for _,n in pairs(outputs) do collisions[n] = (collisions[n] or 0) + n; end local y = os.clock(); local sortcol = {} for k,v in pairs(collisions) do table.insert(sortcol, k); end table.sort(sortcol); -- Compute expected Poisson distribution expect = {} lambda = num/modulus; fly = lambda * math.exp(-lambda); for c=1,10 do expect[c] = fly * modulus; fly = fly * lambda / c; end -- Output results and expected results for _,k in ipairs(sortcol) do print(string.format(" %2u: %5u (expected %5u)", k, collisions[k], expect[k] or -1)); end return y-x; end -- Bit distribution test: Measure number of times each bit is set, should average out to 50% chance in the end -- Will be presented as the absolute difference from 50% function bitdistribtest(func,num,lo,hi,dispmin) math.randomseed(1); local bitsset={}; for i=1,num do str = string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi)); local n = func(str); for k=1,32 do if(mod(n,2)==1) then bitsset[k] = (bitsset[k] or 0)+1; end n=math.floor(n/2); end end for k=1,32 do local diff = math.abs(50-(bitsset[k] or 0)/num*100); if(diff > dispmin) then print(string.format(" %2u: %.1f%%", k, diff)); end end end -- Collision finder test: Loop all possible values in given set and count collisions (yes, this is slow, and uses quite a bit of ram) function collisionfinder(func, lo, hi, bytes, prefix, suffix) local str; local coll={}; prefix = prefix or ""; suffix = suffix or ""; if(bytes==2) then for b=lo,hi do for a=lo,hi do str = prefix .. string.char(a,b) .. suffix; local n = func(str); coll[n] = (coll[n] or 0) + 1; end end elseif(bytes==3) then for c=lo,hi do for b=lo,hi do for a=lo,hi do str = prefix .. string.char(a,b,c) .. suffix; local n = func(str); coll[n] = (coll[n] or 0) + 1; end end end elseif(bytes==4) then for d=lo,hi do for c=lo,hi do for b=lo,hi do for a=lo,hi do str = prefix .. string.char(a,b,c,d) .. suffix; local n = func(str); coll[n] = (coll[n] or 0) + 1; end end end end elseif(bytes==5) then for e=lo,hi do for d=lo,hi do for c=lo,hi do for b=lo,hi do for a=lo,hi do str = prefix .. string.char(a,b,c,d,e) .. suffix; local n = func(str); coll[n] = (coll[n] or 0) + 1; end end end end end end local collcnt = {}; for k,n in coll do collcnt[n] = (collcnt[n] or 0) + 1; end for colls, count in pairs(collcnt) do if(colls==1) then print(string.format(" %5u non-colliding values", count)); else print(string.format(" %5u instances of %u inputs producing the same hash (%u strings)", count, colls, count*colls)); end end end --------------------------------------------------------------------------- -- Run the tests if(true) then local colls; local function longrand(lo,hi) return function() local ret = ""; while colls[ret] do ret = "aaaaaaaaa"..string.char(math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi), math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi),math.random(lo,hi)); end colls[ret] = true; return ret; end end print ""; print ""; print "Collision test - long random string 32-127 (20000 values into 32768 buckets)"; print "------------------------------------------" print("JavaHash:"); colls = {}; print(collisiontest(JavaHash, 20000, 1,32768, longrand(32,127))); print("StringHash:"); colls = {}; print(collisiontest(StringHash, 20000, 1,32768, longrand(32,127))); print ""; print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 1-15)"; print "------------------------------------------" print("JavaHash:"); colls = {}; print(collisiontest(JavaHash, 20000, 1,32768, longrand(0,255))); print("StringHash:"); colls = {}; print(collisiontest(StringHash, 20000, 1,32768, longrand(0,255))); print ""; print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 6-20)"; print "------------------------------------------" print("JavaHash:"); colls = {}; print(collisiontest(JavaHash, 20000, math.pow(2,5),32768, longrand(0,255))); print("StringHash:"); colls = {}; print(collisiontest(StringHash, 20000, math.pow(2,5),32768, longrand(0,255))); print ""; print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 12-26)"; print "------------------------------------------" print("JavaHash:"); colls = {}; print(collisiontest(JavaHash, 20000, math.pow(2,11),32768, longrand(0,255))); print("StringHash:"); colls = {}; print(collisiontest(StringHash, 20000, math.pow(2,11),32768, longrand(0,255))); print ""; print "Collision test - long random string 0-255 (20000 values into 32768 buckets, bits 18-32)"; print "------------------------------------------" print("JavaHash:"); colls = {}; print(collisiontest(JavaHash, 20000, math.pow(2,17),32768, longrand(0,255))); print("StringHash:"); colls = {}; print(collisiontest(StringHash, 20000, math.pow(2,17),32768, longrand(0,255))); end if(true) then print ""; print ""; print "Performance test"; print "-----------------" print("JavaHash speed: ", perftest(JavaHash, 20000)); print("StringHash speed: ", perftest(StringHash, 20000)); end if(true) then print ""; print ""; print "Bit transmutation test (how many bits change if a single bit changes?)"; print "----------------------"; print(""); print("JavaHash 13 bytes total, change 1-13:"); transtest(JavaHash,50, 13, 1,13); print(""); print("StringHash 13 bytes total, change 1-13:"); transtest(StringHash,50, 13, 1,13); print(""); print("JavaHash 13 bytes total, change 1-2:"); transtest(JavaHash,50, 13, 1,2); print(""); print("StringHash 13 bytes total, change 1-2:"); transtest(StringHash,50, 13, 1,2); print(""); print("JavaHash 13 bytes total, change 12-13:"); transtest(JavaHash,50, 13, 12,13); print(""); print("StringHash 13 bytes total, change 12-13:"); transtest(StringHash,50, 13, 12,13); end if(true) then print ""; print ""; print "Bit distribution test (what's the chance of each bit being set? expressed as abs diff from 50%)"; print "---------------------"; print(""); print("JavaHash 0-255, 10000 loops: (showing diffs>0.4%)"); bitdistribtest(JavaHash,10000,0,255, 0.4); print(""); print("StringHash 0-255, 10000 loops: (showing diffs>0.4%)"); bitdistribtest(StringHash,10000,0,255, 0.4); print(""); print("JavaHash 32-127, 10000 loops: (showing diffs>0.4%)"); bitdistribtest(JavaHash,10000,32,127, 0.4); print(""); print("StringHash 32-127, 10000 loops: (showing diffs>0.4%)"); bitdistribtest(StringHash,10000,32,127, 0.4); end if(true) then print ""; print ""; print "Collision finder (run through all possible values of a set and count collisions)"; print "----------------"; print(""); print("JavaHash, 64-89 x 3 bytes"); collisionfinder(JavaHash, 64, 89, 3); print(""); print("StringHash, 64-89 x 3 bytes"); collisionfinder(StringHash, 64, 89, 3); print(""); print(""); print("JavaHash, 64-89 x 4 bytes"); collisionfinder(JavaHash, 64, 89, 4); print(""); print("StringHash, 64-89 x 4 bytes"); collisionfinder(StringHash, 64, 89, 4); print(""); print(""); print("JavaHash, 32-127 x 2 bytes"); collisionfinder(JavaHash, 32, 127, 2); print(""); print("StringHash, 32-127 x 2 bytes"); collisionfinder(StringHash, 32, 127, 2); print(""); print(""); print("JavaHash, 32-127 x 3 bytes"); collisionfinder(JavaHash, 32, 127, 3); print(""); print("StringHash, 32-127 x 3 bytes"); collisionfinder(StringHash, 32, 127, 3); print(""); print(""); print("JavaHash, 32-127 x 3 bytes, prefix '01234567890123456789'"); collisionfinder(JavaHash, 32, 127, 3, "01234567890123456789"); print(""); print("StringHash, 32-127 x 3 bytes, prefix '01234567890123456789'"); collisionfinder(StringHash, 32, 127, 3, "01234567890123456789"); print(""); print(""); print("JavaHash, 32-127 x 3 bytes, suffix '01234567890123456789'"); collisionfinder(JavaHash, 32, 127, 3, "", "01234567890123456789"); print(""); print("StringHash, 32-127 x 3 bytes, suffix '01234567890123456789'"); collisionfinder(StringHash, 32, 127, 3, "", "01234567890123456789"); print(""); -- This test takes a looooong time if(true) then print(""); print("JavaHash, 64-82 x 5 bytes"); collisionfinder(JavaHash, 64, 82, 5); print(""); print("StringHash, 64-82 x 5 bytes"); collisionfinder(StringHash, 64, 82, 5); end end