A few weeks ago I saw this forum question where a user asked help on making an algorithm that:
“replaces all the vowels in a string with the character ‘*’”
The other users in the forum quickly replied and helped him with that question. For some reason though, that problem stuck to my head. Sure it’s a simple algorithm, but I thought what if I added a new constraint: It has to be done fast, really, really fast.
To me this created a whole new approach to the problem, and I was interested in seeing how far I could take it. So, for starters I took the best reply to the forum thread from user Dominuz (who himself stated that he focused more on clarity and ease of understanding rather than optimization )
Below is his sample code:
#include <stdio.h>
char vogais[] = {'a','e','i','o','u'};
void subst(char* nome)
{
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
int main()
{
char criatura[] = "12 Criatura aeiou 123456";
printf( "%s \n", criatura );
subst( criatura );
printf( "%s \n", criatura );
return 0;
}
For starters, I knew a simple text string would not be enough to measure different algorithms performance. So I decided to take a larger text as input, in this case Michael Jackson’s Wikipedia page as a tribute to the late king of pop. The total file was 273kb.
I modified Dominuz code to open the file, perform the subst function a thousand times, while recording how long it took to do that. The result was stored in a text file. Here’s my initial modification of his code:
#include <iostream>
#include <fstream>
#include <Windows.h>
using namespace std;
char vogais[] = {'a','e','i','o','u'};
void trunc_double()
{
fstream file( "out.txt", ios::out | ios::trunc ) ;
file.close() ;
}
void save_float( double d )
{
char buffer[256] ;
fstream file("out.txt", ios::out | ios::app ) ;
sprintf_s(buffer,"%.15f\n", d ) ;
file.write( buffer, strlen(buffer) ) ;
file.close() ;
}
char* malloc_and_setup_buffer()
{
char *b = NULL ;
int b_size = 0 ;
fstream file ;
file.open( "mj.txt", ios::in | ios::binary ) ;
if( !file )
return NULL ;
file.seekg( 0, ios_base::end ) ;
b_size = (int) file.tellg() ;
file.seekg( 0, ios_base::beg ) ;
b = new char[b_size];
file.read( b, b_size ) ;
file.close() ;
b[b_size-1] = '\0' ;
return b ;
}
void dealloc_buffer( char* b )
{
if( !b )
return ;
delete[] b ;
}
void subst(char* nome)
{
if( !nome )
return ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
int main()
{
LARGE_INTEGER lg0, lg1, frequency ;
QueryPerformanceFrequency(&frequency) ;
trunc_double() ;
for( int i=0; i<1000; i++ )
{
char* b = malloc_and_setup_buffer() ;
if( !b )
return 0;
QueryPerformanceCounter(&lg0) ;
subst( b );
QueryPerformanceCounter(&lg1) ;
float dt = (float)(lg1.QuadPart - lg0.QuadPart)/(float)frequency.QuadPart;
save_float(dt) ;
dealloc_buffer( b ) ;
}
return 0;
}
As you can see, I’m only worried on how long it takes to run the subst function. After putting this data into excel, I got that the average for sample one is 0.002205584 seconds. From that basic setup, I started the optimizations.
A quickly saw that you could force inline the subst function, making a call to it faster. However the biggest issue I had with it was accessing ‘vogais’ as a global variable, instead of a local function variable.
What’s the big deal about that? Well when you reference a global variable you are referencing an area of memory, contrary to when you access a local variable, which implies you are referencing the stack. And in terms of access speeds, stack beats memory.
So after re-implementing subst I got this:
__inline void subst(char* nome)
{
if( !nome )
return ;
char vogais[] = {'a','e','i','o','u'};
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
After running it again and calculating the average I got 0.001782024 seconds. Not bad, I got a bit of an improvement over the previous algorithm. It still wasn’t enough to say I had a significant impact though.
If you look inside subst we have two loops, one iterates through each letter while the other iterates through each vowel. Well loops translate into assembly as jump instructions and those are expensive. So in order to get rid of jump instructions I performed a technique called ‘loop unrolling’. Below is the result:
__inline void subst(char* nome)
{
if( !nome )
return ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
if( nome[i] == 'a' || nome[i] == 'e' || nome[i] == 'i' || nome[i] == 'o' || nome[i] == 'u' )
nome[i] = '*';
}
}
Its average was 0.001161438 seconds. Not bad! Almost twice the performance when compared to the first one. But I wasn’t ready to finish yet.
You see my big issue with this model is that we are performing five vowel comparisons per letter. This translates to several compare and jump instructions in assembly and that’s just slow. I had to think of a way of getting rid of those comparisons.
After giving it some thought, I found a way out! I used a lookup table. Since bytes can only have 256 different values, I created a lookup table with 256 bytes in size. All bytes were set to 0, except indexes 97, 101, 105, 111 and 117, who were set to 1. What’s special with those values? Well they are exactly the indexes for the vowels a, e, i, o and u in the ascii chart.
I thus use the letter itself as the index in the lookup table, which indicates if it’s a vowel or not. Here’s the code:
__inline void subst(char* nome)
{
if( !nome )
return ;
char table[256] ;
memset( table, 0, sizeof table ) ;
table['a'] = 1 ;
table['e'] = 1 ;
table['i'] = 1 ;
table['o'] = 1 ;
table['u'] = 1 ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
if( table[ nome[i] ] )
nome[i] = '*';
}
The result? 0.000951169seconds. Not bad, managed to go in under 1 microsecond.
I kept thinking though that there was something else that I could add to this code… somehow I was missing something obvious. After a few minutes it hit me! I could do loop unrolling again and perform even less jump instructions!
After some testing, I decided to unroll the loop 4 times. For that I had to split the loop into two parts. The first part would have to iterate all the way up to the closest multiple of four, but not greater than the string’s size. The second loop I would need to take care of the last 3 potential characters left in the text
Now to reach the closest multiple of a number using integer arithmetic you do:
Number = ( number * multiple ) / multiple
Due to the nature of integer division, since it naturally rounds down the numbers, we reach our value. The problem though is that multiplication and division are expensive and should be avoided in fast algorithm construction. Is there a way to get rid of them?
Well yes! We just have to use bitwise operations. I didn’t just choose to unroll the loop four times for no reason. Since four is a power of two, it thus has the following property: adding it to a number, then performing the bitwise AND with its negative bitwise gives the closer or greater multiple of that number. We just subtract the number once from the result and alas, we have the closest or lower multiple of four from the original number.
So finally here’s the result:
__inline void subst( buffer_data bd )
{
char *b = bd.b ;
char table[256] ;
memset( table, 0, sizeof table ) ;
table['a'] = 1 ;
table['e'] = 1 ;
table['i'] = 1 ;
table['o'] = 1 ;
table['u'] = 1 ;
int i ;
const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
for( i = 0; i < upper_s ; i+=4 )
{
if( table[ bd.b[i] ] )
bd.b[i] = '*';
if( table[ bd.b[i+1] ] )
bd.b[i+1] = '*';
if( table[ bd.b[i+2] ] )
bd.b[i+2] = '*';
if( table[ bd.b[i+3] ] )
bd.b[i+3] = '*';
}
for( ; i < bd.size_b ; i++ )
{
if( table[ bd.b[i] ] )
bd.b[i] = '*';
}
}
The result is then 0.000845947 seconds. I was almost ready to settle with it but I remembered one last detail: cache.
I’m using a 256 size lookup table. But I only care about the alphabet characters in the ascii chart. I could thus reduce the table to 32, which is the closest power of two multiple from 23. The bright side of having a smaller lookup table is that we manage to maintain it longer in the CPU’s cache. Less cache misses, faster algorithm. So here’s the code with that in mind:
__inline void subst( buffer_data bd )
{
char *b = bd.b ;
char table[32] ;
memset( table, 0, sizeof table ) ;
table['a'-97] = 1 ;
table['e'-97] = 1 ;
table['i'-97] = 1 ;
table['o'-97] = 1 ;
table['u'-97] = 1 ;
int i ;
const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
for( i = 0; i < upper_s ; i+=4 )
{
if( bd.b[i] < 97 || bd.b[i] > 122 )
continue ;
if( table[ bd.b[i] - 97 ] )
bd.b[i] = '*';
if( table[ bd.b[i+1] - 97 ] )
bd.b[i+1] = '*';
if( table[ bd.b[i+2] - 97 ] )
bd.b[i+2] = '*';
if( table[ bd.b[i+3] - 97 ] )
bd.b[i+3] = '*';
}
for( ; i < bd.size_b ; i++ )
{
if( bd.b[i] < 97 || bd.b[i] > 122 )
continue ;
if( table[ bd.b[i] - 97 ] )
bd.b[i] = '*';
}
}
And the final result is 0.00084498. Not much faster from the previous algorithm but a lot faster when compared to the first one. In fact it’s over two and half times faster.
Now I’m sure that if I kept at it I’d find even better ways to optimize this algorithm, but I’m settling with what I got for now. With this exercise I just wanted to prove a few points:
1. Knowing computer architecture and how code translates into assembly can be quite useful.
2. Like Michael Abrash pointed out, there’s no such thing as the fastest code in the west.
3. Knowing bitwise and pointer arithmetic helps as well.
4. I should have a better social life.
Well I guess that’s it for now! Click here: Fast Vowel (87) to download the sample code and take a look at it yourself. Do keep in mind that I only tested this in one machine and in different environments the results may vary.
The comparisson table:
- Dominuz code: 0.002205584
- Local variable + inline : 0.001782024
- vowel loop unrolling : 0.001161438
- 256 lookup table : 0.000951169
- 256 lookup table + loop unrolling : 0.000845947
- 32 lookup table + loop unrolling : 0.00084498