I have been on a bit of a pointless optimization kick lately, and decided to see what I could do with string search. Most of the fast string search algorithms work on the principle of a sliding window for the purpose of skipping characters which don’t need to be checked. The best of these use a fair (fair being relative) amount of storage and extra cycles in the loop to make sure they are skipping as many characters as possible.
I am sure this has been done before, but I haven’t seen it. In the code below I have implemented an unintrusive extra level of skips to the well known Boyer-Moore-Horspool algorithm. Basically, each character of the key is AND’ed together and the result stored. If the result is zero, which happens if the alphabet is large/sparse enough, the extra checks are conditionalized away. In the event that the result is non-zero, we very quickly check for mismatches in the haystack by AND’ing the haystack character being checked against our previous result, and checking to see if the result of that is equal to our previous result. If the two results are equal, we have just checked a potentially matching character, and we need to fall back to our regular checking routine. In some cases we will match a non-matching character and our efforts will have been wasted, but in other cases we will have determined a non-match and be able to skip the full length of the needle in just a few instructions.
Original source for Boyer-Moore-Horspool lifted from Wikipedia.
boyer-moore-horspool.c
boyer-moore-horspool-sjg.c
Below are some quickie results from my X3220, compiled with GCC 4.2.1.
$ gcc -O2 boyer-moore-horspool.c -o boyer-moore-horspool
$ gcc -O2 boyer-moore-horspool-sjg.c -o boyer-moore-horspool-sjg
$ time ./boyer-moore-horspool
61
28
16
./boyer-moore-horspool 5.53s user 0.00s system 99% cpu 5.530 total
$ time ./boyer-moore-horspool-sjg
61
28
16
./boyer-moore-horspool-sjg 5.21s user 0.00s system 99% cpu 5.210 total
$ gcc -O3 -mtune=nocona boyer-moore-horspool.c -o boyer-moore-horspool
$ gcc -O3 -mtune=nocona boyer-moore-horspool-sjg.c -o boyer-moore-horspool-sjg
$ time ./boyer-moore-horspool
61
28
16
./boyer-moore-horspool 5.28s user 0.00s system 99% cpu 5.282 total
$ time ./boyer-moore-horspool-sjg
61
28
16
./boyer-moore-horspool-sjg 5.02s user 0.01s system 99% cpu 5.034 total
The, dare I say “elegant”, thing about this addition is that it could relatively easily be applied to many other string search algorithms and completely conditionalized away from the inner loop if the results are going to be ineffectual.
Hi sjg,
These are the results for a UltraSparc 500Mhz with 1GB RAM running FBSD 7.1 (gcc-4.2.1)
> gcc boyer-moore-horspool.c -o boyer-moore-horspool
> gcc -o boyer-moore-horspool-sjg boyer-moore-horspool-sjg.c
> time ./boyer-moore-horspool
61
28
16
357.268u 0.040s 5:58.95 99.5% 80+7886k 0+0io 0pf+0w
> time ./boyer-moore-horspool-sjg
61
28
16
361.988u 0.036s 6:03.90 99.4% 80+8101k 0+0io 0pf+0w
—-
> gcc -O2 -o boyer-moore-horspool-sjg boyer-moore-horspool-sjg.c
> gcc -O2 -o boyer-moore-horspool boyer-moore-horspool.c
> time ./boyer-moore-horspool
61
28
16
45.401u 0.008s 0:45.62 99.5% 80+7782k 0+0io 0pf+0w
> time ./boyer-moore-horspool-sjg
61
28
16
49.295u 0.012s 0:49.53 99.5% 80+7976k 0+0io 0pf+0w
———–
> gcc -O3 -o boyer-moore-horspool boyer-moore-horspool.c
> gcc -O3 -o boyer-moore-horspool-sjg boyer-moore-horspool-sjg.c
> time ./boyer-moore-horspool
61
28
16
46.027u 0.013s 0:46.55 98.8% 80+7627k 0+0io 0pf+0w
> time ./boyer-moore-horspool-sjg
61
28
16
41.996u 0.012s 0:42.20 99.5% 80+7765k 0+0io 0pf+0w
>
———
> gcc -O3 -mcpu=ultrasparc -o boyer-moore-horspool boyer-moore-horspool.c
> gcc -O3 -mcpu=ultrasparc -o boyer-moore-horspool-sjg boyer-moore-horspool-sjg.c
> time ./boyer-moore-horspool
61
28
16
45.901u 0.012s 0:46.17 99.4% 80+7626k 0+0io 0pf+0w
> time ./boyer-moore-horspool-sjg
61
28
16
42.014u 0.011s 0:42.21 99.5% 80+7765k 0+0io 0pf+0w