Step by Step fast algorithm construction in native code

By J.Raza On January 28th, 2011

A few weeks ago I saw this forum question where a user asked help on making an algorithm that:

“replaces all the vowels in a string with the character ‘*’”

The other users in the forum quickly replied and helped him with that question. For some reason though, that problem stuck to my head. Sure it’s a simple algorithm, but I thought what if I added a new constraint: It has to be done fast, really, really fast.

To me this created a whole new approach to the problem, and I was interested in seeing how far I could take it. So, for starters I took the best reply to the forum thread from user Dominuz (who himself stated that he focused more on clarity and ease of understanding rather than optimization )

Below is his sample code:

#include <stdio.h>

char vogais[] = {'a','e','i','o','u'};

void subst(char* nome)
{
        for( unsigned int i = 0; nome[i] != '\0'; ++i )
        {
                for( unsigned int j = 0; j < 5; ++j )
                {
                        if( nome[i] == vogais[j] )
                        {
                                nome[i] = '*';
                                break;
                        }
                }
        }       
}

int main()
{
        char criatura[] = "12 Criatura aeiou 123456";
        printf( "%s \n", criatura );
        
        subst( criatura );
        
        printf( "%s \n", criatura );
        
        return 0;
}

For starters, I knew a simple text string would not be enough to measure different algorithms performance. So I decided to take a larger text as input, in this case Michael Jackson’s Wikipedia page as a tribute to the late king of pop. The total file was 273kb.

I modified Dominuz code to open the file, perform the subst function a thousand times, while recording how long it took to do that. The result was stored in a text file. Here’s my initial modification of his code:

#include <iostream>
#include <fstream>
#include <Windows.h>

using namespace std; 

char vogais[] = {'a','e','i','o','u'};

void trunc_double()
{
	fstream file( "out.txt", ios::out | ios::trunc ) ;
	file.close() ;
}

void save_float( double d )
{
	char buffer[256] ;
	fstream file("out.txt", ios::out | ios::app ) ;

	sprintf_s(buffer,"%.15f\n", d ) ;
	file.write( buffer, strlen(buffer) ) ;
	file.close() ;
}

char* malloc_and_setup_buffer()
{
	char *b = NULL ;
	int b_size = 0 ;
	fstream file ;

	file.open( "mj.txt", ios::in | ios::binary ) ;
	if( !file ) 
		return NULL ;

	file.seekg( 0, ios_base::end ) ;
	b_size = (int) file.tellg() ;
	file.seekg( 0, ios_base::beg ) ;
	b = new char[b_size];
	file.read( b, b_size ) ;
	file.close() ;
	b[b_size-1] = '\0' ;

	return b ;
}

void dealloc_buffer( char* b )
{
	if( !b ) 
		return ;

	delete[] b ;
}

void subst(char* nome)
{
	if( !nome ) 
		return ;

    for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        for( unsigned int j = 0; j < 5; ++j )
        {
                if( nome[i] == vogais[j] )
                {
                        nome[i] = '*';
                        break;
                }
        }
    }      
}


int main()
{
	LARGE_INTEGER lg0, lg1, frequency ;	
	QueryPerformanceFrequency(&frequency) ;

	trunc_double() ;

	for( int i=0; i<1000; i++ )
	{
		char* b = malloc_and_setup_buffer() ;
		if( !b ) 
			return 0;
		QueryPerformanceCounter(&lg0) ;
		subst( b );
		QueryPerformanceCounter(&lg1) ;

		float dt = (float)(lg1.QuadPart - lg0.QuadPart)/(float)frequency.QuadPart;
	
		save_float(dt) ;
	
		dealloc_buffer( b ) ;		    	
	}

	return 0;
}

As you can see, I’m only worried on how long it takes to run the subst function. After putting this data into excel, I got that the average for sample one is 0.002205584 seconds. From that basic setup, I started the optimizations.

A quickly saw that you could force inline the subst function, making a call to it faster. However the biggest issue I had with it was accessing ‘vogais’ as a global variable, instead of a local function variable.

What’s the big deal about that? Well when you reference a global variable you are referencing an area of memory, contrary to when you access a local variable, which implies you are referencing the stack. And in terms of access speeds, stack beats memory.

So after re-implementing subst I got this:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	char vogais[] = {'a','e','i','o','u'};

    for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        for( unsigned int j = 0; j < 5; ++j )
        {
                if( nome[i] == vogais[j] )
                {
                        nome[i] = '*';
                        break;
                }
        }
    }      
}

After running it again and calculating the average I got 0.001782024 seconds. Not bad, I got a bit of an improvement over the previous algorithm. It still wasn’t enough to say I had a significant impact though.

If you look inside subst we have two loops, one iterates through each letter while the other iterates through each vowel. Well loops translate into assembly as jump instructions and those are expensive. So in order to get rid of jump instructions I performed a technique called ‘loop unrolling’. Below is the result:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        if( nome[i] == 'a' || nome[i] == 'e' || nome[i] == 'i' || nome[i] == 'o' || nome[i] == 'u' )
			nome[i] = '*';
    }      
}

Its average was 0.001161438 seconds. Not bad! Almost twice the performance when compared to the first one. But I wasn’t ready to finish yet.

You see my big issue with this model is that we are performing five vowel comparisons per letter. This translates to several compare and jump instructions in assembly and that’s just slow. I had to think of a way of getting rid of those comparisons.

After giving it some thought, I found a way out! I used a lookup table. Since bytes can only have 256 different values, I created a lookup table with 256 bytes in size. All bytes were set to 0, except indexes 97, 101, 105, 111 and 117, who were set to 1. What’s special with those values? Well they are exactly the indexes for the vowels a, e, i, o and u in the ascii chart.

I thus use the letter itself as the index in the lookup table, which indicates if it’s a vowel or not. Here’s the code:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	char table[256] ;
	memset( table, 0, sizeof table ) ;
	table['a'] = 1 ;
	table['e'] = 1 ;
	table['i'] = 1 ;
	table['o'] = 1 ;
	table['u'] = 1 ;

	for( unsigned int i = 0; nome[i] != '\0'; ++i )
		if( table[ nome[i] ] )
			nome[i] = '*';       
}

The result? 0.000951169seconds. Not bad, managed to go in under 1 microsecond.

I kept thinking though that there was something else that I could add to this code… somehow I was missing something obvious. After a few minutes it hit me! I could do loop unrolling again and perform even less jump instructions!

After some testing, I decided to unroll the loop 4 times. For that I had to split the loop into two parts. The first part would have to iterate all the way up to the closest multiple of four, but not greater than the string’s size. The second loop I would need to take care of the last 3 potential characters left in the text

Now to reach the closest multiple of a number using integer arithmetic you do:

Number = ( number * multiple ) / multiple

Due to the nature of integer division, since it naturally rounds down the numbers, we reach our value. The problem though is that multiplication and division are expensive and should be avoided in fast algorithm construction. Is there a way to get rid of them?

Well yes! We just have to use bitwise operations. I didn’t just choose to unroll the loop four times for no reason. Since four is a power of two, it thus has the following property: adding it to a number, then performing the bitwise AND with its negative bitwise gives the closer or greater multiple of that number. We just subtract the number once from the result and alas, we have the closest or lower multiple of four from the original number.

So finally here’s the result:

__inline void subst( buffer_data bd )
{
	char *b = bd.b ;	
	char table[256] ;

	memset( table, 0, sizeof table ) ;
	table['a'] = 1 ;
	table['e'] = 1 ;
	table['i'] = 1 ;
	table['o'] = 1 ;
	table['u'] = 1 ;

	int i ;
	const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
	for( i = 0; i < upper_s ; i+=4 )
	{
		if( table[ bd.b[i] ] )
			bd.b[i] = '*';       

		if( table[ bd.b[i+1] ] )
			bd.b[i+1] = '*';       

		if( table[ bd.b[i+2] ] )
			bd.b[i+2] = '*';       

		if( table[ bd.b[i+3] ] )
			bd.b[i+3] = '*';       
	}

	for( ; i < bd.size_b ; i++ )
	{
		if( table[ bd.b[i] ] )
			bd.b[i] = '*';
	}
}

The result is then 0.000845947 seconds. I was almost ready to settle with it but I remembered one last detail: cache.

I’m using a 256 size lookup table. But I only care about the alphabet characters in the ascii chart. I could thus reduce the table to 32, which is the closest power of two multiple from 23. The bright side of having a smaller lookup table is that we manage to maintain it longer in the CPU’s cache. Less cache misses, faster algorithm. So here’s the code with that in mind:

__inline void subst( buffer_data bd )
{
	char *b = bd.b ;	
	char table[32] ;

	memset( table, 0, sizeof table ) ;
	table['a'-97] = 1 ;
	table['e'-97] = 1 ;
	table['i'-97] = 1 ;
	table['o'-97] = 1 ;
	table['u'-97] = 1 ;

	int i ;
	const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
	for( i = 0; i < upper_s ; i+=4 )
	{
		if( bd.b[i] < 97 || bd.b[i] > 122 )
			continue ;

		if( table[ bd.b[i] - 97 ] )
			bd.b[i] = '*';       

		if( table[ bd.b[i+1] - 97 ] )
			bd.b[i+1] = '*';       

		if( table[ bd.b[i+2] - 97 ] )
			bd.b[i+2] = '*';       

		if( table[ bd.b[i+3] - 97 ] )
			bd.b[i+3] = '*';       
	}

	for( ; i < bd.size_b ; i++ )
	{
		if( bd.b[i] < 97 || bd.b[i] > 122 )
			continue ;

		if( table[ bd.b[i] - 97 ] )
			bd.b[i] = '*';
	}
}

And the final result is 0.00084498. Not much faster from the previous algorithm but a lot faster when compared to the first one. In fact it’s over two and half times faster.

Now I’m sure that if I kept at it I’d find even better ways to optimize this algorithm, but I’m settling with what I got for now. With this exercise I just wanted to prove a few points:

1.      Knowing computer architecture and how code translates into assembly can be quite useful.

2.      Like Michael Abrash pointed out, there’s no such thing as the fastest code in the west.

3.      Knowing bitwise and pointer arithmetic helps as well.

4.      I should have a better social life.

Well I guess that’s it for now! Click here: Fast Vowel (87) to download the sample code and take a look at it yourself. Do keep in mind that I only tested this in one machine and in different environments the results may vary.

The comparisson table:

  • Dominuz code: 0.002205584
  • Local variable + inline : 0.001782024
  • vowel loop unrolling : 0.001161438
  • 256 lookup table : 0.000951169
  • 256 lookup table + loop unrolling : 0.000845947
  • 32 lookup table + loop unrolling : 0.00084498

Reading, re-reading and groking

By J.Raza On August 7th, 2010

We’ll it took a while, but I managed to finally finish reading Michael Abrash’s Graphics Programming Black Book:

In some ways I feel ashamed to say that I ‘read’ it because it’s a behemoth of knowledge spanning over 1200 pages. Of course I learned a lot from it but it’s the sort of book I’ll read again and again and again until I can finally grok it. It’s a collection of over 10 years of Abrash’s papers and I doubt one can absorve it in a matter of months.

You see to learn programming concepts in a self taught manner I think it’s crucial to not only read the code in the book, but also write it down, play around with it, to truly understand what is being taught in its finer details. With my current project I intend to do just that, since it’s a FPS and the last chapters on the book concern directly with Quakes development at id Software, where Abrash worked.

What’s more interesting is that the author doesn’t focus only in the development aspect of programming, but in the general mentality of it. Not as one solves a problem, but the mind that solves it. As you become better in development I think you ask yourself less “how to solve this problem” but more of “what is the best way to solve this problem”. Abrash shows us several ways to solve a problem in the book, be it linked lists, spatial visibility or making a faster game of life, each one consistently faster than the other with either assembly optmizations, algorithm optmizations or rethinking the whole approach to the problem. The idea is to not expect that there is only one way to handle an issue. In his own words : “Assume nothing”.

The book is also quite pleasant to read, since the author narrates the development cycle more as a journal than a tech book. It’s quite interesting to read the last chapters where he focus on making a faster rendering back-to-front polygon rendering approach to Quake. Almost goes like this:

March 14, 1941. We begin our approach to the BSP tree, were still having heavy losses on how to figure out a way to make the spatial visibility problem faster. The worlds we want with Quake  feature at least 5000 polygons and in the worse case scenario we redraw each pixel 5 times. It’s too slow, we must take a better approach.

May 22, 1941. We sucessfully managed to create a potentially visible set (pvs) that managed to break into enemy lines. We will now proceed to use it to flank their defeneses.

June 10, 1941. We have now conquered the enemy’s battlefield. I’ve reduced the inner loop of the rasterizer to 2.5 cycles per texel. We decided to use z-buffering for drawing the enemy meshes, since it’s faster and not that big of a problem as we expected. Victory is eminent.

And so on. Overall the book can be divided into 3 parts:

  • General assembly optimization techniques
  • 3D rendering done via software
  • Common 3D engine development problems and solutions.

I recommend it to anyone that’s interested in taking game development or programming in a seriously yet elegant manner. I learned a lot from it, and still intend to learn more.

Assembly stuff

By J.Raza On March 10th, 2010

I’ve always admired Michael Abrash. If you don’t know who he is the man is sort of a legend. He helped develop the original quake engine, wrote tons of articles on how to get the maximum speed out of your pc in the early 90s when cycle counting used to be something respected.

I’ve always had interest in learning from him and reading his articles so I picked up courage and started reading assembly books for the 8086. I picked up a real nice one, Assembly Language Step-by-Step 2nd edition, and started studying.

I could say I “learned” this in college, assembly for the 8086 but it was a basic course and taught mostly the beginning stuff you know. mov this, add that, inc this, int 21h that. Just because I could understand the instructions from an individual point of view for me it was not enough, I wanted to get to the level Abrash talked about.

So I finished that book and have been reading and re-reading it as I go through Zen of Code Optmization and Write Great Code vol2.

I must say it’s being a bumpy ride, going back and foward with these books. Reading a chapter from one, going back to the other, while trying to understand what I just read. No wonder it takes time to master this sort of stuff.

Personally to me also it’s very satisfying to be able to read these sorts of books and be able to at least understand them. Gives a great sense of improvement. When I finish all 3 of them let’s see how I am.

But right now I want to share with you a small victory I feel I just had. One basic memory handling function in C is

memset( dest, val, size ) ;

I went inside it in it’s assembly code and managed to understand it. The most important instruction inside this function is

rep stosd

which is what causes the memory to be set once all the registors have been setup. Inside theres a bunch of checks for redundancies and type safeties, so I wrote the following that does that a memset and only the memory settting. No type checking, register juggling, no nothing. This is what I got:

mov eax, 10
mov ecx, 16
lea edi, dword ptr a
rep stosd

Which is basically what the Assembly book step by step teaches in one of its chapters. In the end this ends being 2 cycles faster than memset.

2 cycles faster. I am proud of myself ahah.

yes a small victory, but hopefully one of many to come. Let’s see how this goes.