Step by Step fast algorithm construction in native code

By J.Raza On January 28th, 2011

A few weeks ago I saw this forum question where a user asked help on making an algorithm that:

“replaces all the vowels in a string with the character ‘*’”

The other users in the forum quickly replied and helped him with that question. For some reason though, that problem stuck to my head. Sure it’s a simple algorithm, but I thought what if I added a new constraint: It has to be done fast, really, really fast.

To me this created a whole new approach to the problem, and I was interested in seeing how far I could take it. So, for starters I took the best reply to the forum thread from user Dominuz (who himself stated that he focused more on clarity and ease of understanding rather than optimization )

Below is his sample code:

#include <stdio.h>

char vogais[] = {'a','e','i','o','u'};

void subst(char* nome)
{
        for( unsigned int i = 0; nome[i] != '\0'; ++i )
        {
                for( unsigned int j = 0; j < 5; ++j )
                {
                        if( nome[i] == vogais[j] )
                        {
                                nome[i] = '*';
                                break;
                        }
                }
        }       
}

int main()
{
        char criatura[] = "12 Criatura aeiou 123456";
        printf( "%s \n", criatura );
        
        subst( criatura );
        
        printf( "%s \n", criatura );
        
        return 0;
}

For starters, I knew a simple text string would not be enough to measure different algorithms performance. So I decided to take a larger text as input, in this case Michael Jackson’s Wikipedia page as a tribute to the late king of pop. The total file was 273kb.

I modified Dominuz code to open the file, perform the subst function a thousand times, while recording how long it took to do that. The result was stored in a text file. Here’s my initial modification of his code:

#include <iostream>
#include <fstream>
#include <Windows.h>

using namespace std; 

char vogais[] = {'a','e','i','o','u'};

void trunc_double()
{
	fstream file( "out.txt", ios::out | ios::trunc ) ;
	file.close() ;
}

void save_float( double d )
{
	char buffer[256] ;
	fstream file("out.txt", ios::out | ios::app ) ;

	sprintf_s(buffer,"%.15f\n", d ) ;
	file.write( buffer, strlen(buffer) ) ;
	file.close() ;
}

char* malloc_and_setup_buffer()
{
	char *b = NULL ;
	int b_size = 0 ;
	fstream file ;

	file.open( "mj.txt", ios::in | ios::binary ) ;
	if( !file ) 
		return NULL ;

	file.seekg( 0, ios_base::end ) ;
	b_size = (int) file.tellg() ;
	file.seekg( 0, ios_base::beg ) ;
	b = new char[b_size];
	file.read( b, b_size ) ;
	file.close() ;
	b[b_size-1] = '\0' ;

	return b ;
}

void dealloc_buffer( char* b )
{
	if( !b ) 
		return ;

	delete[] b ;
}

void subst(char* nome)
{
	if( !nome ) 
		return ;

    for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        for( unsigned int j = 0; j < 5; ++j )
        {
                if( nome[i] == vogais[j] )
                {
                        nome[i] = '*';
                        break;
                }
        }
    }      
}


int main()
{
	LARGE_INTEGER lg0, lg1, frequency ;	
	QueryPerformanceFrequency(&frequency) ;

	trunc_double() ;

	for( int i=0; i<1000; i++ )
	{
		char* b = malloc_and_setup_buffer() ;
		if( !b ) 
			return 0;
		QueryPerformanceCounter(&lg0) ;
		subst( b );
		QueryPerformanceCounter(&lg1) ;

		float dt = (float)(lg1.QuadPart - lg0.QuadPart)/(float)frequency.QuadPart;
	
		save_float(dt) ;
	
		dealloc_buffer( b ) ;		    	
	}

	return 0;
}

As you can see, I’m only worried on how long it takes to run the subst function. After putting this data into excel, I got that the average for sample one is 0.002205584 seconds. From that basic setup, I started the optimizations.

A quickly saw that you could force inline the subst function, making a call to it faster. However the biggest issue I had with it was accessing ‘vogais’ as a global variable, instead of a local function variable.

What’s the big deal about that? Well when you reference a global variable you are referencing an area of memory, contrary to when you access a local variable, which implies you are referencing the stack. And in terms of access speeds, stack beats memory.

So after re-implementing subst I got this:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	char vogais[] = {'a','e','i','o','u'};

    for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        for( unsigned int j = 0; j < 5; ++j )
        {
                if( nome[i] == vogais[j] )
                {
                        nome[i] = '*';
                        break;
                }
        }
    }      
}

After running it again and calculating the average I got 0.001782024 seconds. Not bad, I got a bit of an improvement over the previous algorithm. It still wasn’t enough to say I had a significant impact though.

If you look inside subst we have two loops, one iterates through each letter while the other iterates through each vowel. Well loops translate into assembly as jump instructions and those are expensive. So in order to get rid of jump instructions I performed a technique called ‘loop unrolling’. Below is the result:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	for( unsigned int i = 0; nome[i] != '\0'; ++i )
	{
        if( nome[i] == 'a' || nome[i] == 'e' || nome[i] == 'i' || nome[i] == 'o' || nome[i] == 'u' )
			nome[i] = '*';
    }      
}

Its average was 0.001161438 seconds. Not bad! Almost twice the performance when compared to the first one. But I wasn’t ready to finish yet.

You see my big issue with this model is that we are performing five vowel comparisons per letter. This translates to several compare and jump instructions in assembly and that’s just slow. I had to think of a way of getting rid of those comparisons.

After giving it some thought, I found a way out! I used a lookup table. Since bytes can only have 256 different values, I created a lookup table with 256 bytes in size. All bytes were set to 0, except indexes 97, 101, 105, 111 and 117, who were set to 1. What’s special with those values? Well they are exactly the indexes for the vowels a, e, i, o and u in the ascii chart.

I thus use the letter itself as the index in the lookup table, which indicates if it’s a vowel or not. Here’s the code:

__inline void subst(char* nome)
{
	if( !nome ) 
		return ;

	char table[256] ;
	memset( table, 0, sizeof table ) ;
	table['a'] = 1 ;
	table['e'] = 1 ;
	table['i'] = 1 ;
	table['o'] = 1 ;
	table['u'] = 1 ;

	for( unsigned int i = 0; nome[i] != '\0'; ++i )
		if( table[ nome[i] ] )
			nome[i] = '*';       
}

The result? 0.000951169seconds. Not bad, managed to go in under 1 microsecond.

I kept thinking though that there was something else that I could add to this code… somehow I was missing something obvious. After a few minutes it hit me! I could do loop unrolling again and perform even less jump instructions!

After some testing, I decided to unroll the loop 4 times. For that I had to split the loop into two parts. The first part would have to iterate all the way up to the closest multiple of four, but not greater than the string’s size. The second loop I would need to take care of the last 3 potential characters left in the text

Now to reach the closest multiple of a number using integer arithmetic you do:

Number = ( number * multiple ) / multiple

Due to the nature of integer division, since it naturally rounds down the numbers, we reach our value. The problem though is that multiplication and division are expensive and should be avoided in fast algorithm construction. Is there a way to get rid of them?

Well yes! We just have to use bitwise operations. I didn’t just choose to unroll the loop four times for no reason. Since four is a power of two, it thus has the following property: adding it to a number, then performing the bitwise AND with its negative bitwise gives the closer or greater multiple of that number. We just subtract the number once from the result and alas, we have the closest or lower multiple of four from the original number.

So finally here’s the result:

__inline void subst( buffer_data bd )
{
	char *b = bd.b ;	
	char table[256] ;

	memset( table, 0, sizeof table ) ;
	table['a'] = 1 ;
	table['e'] = 1 ;
	table['i'] = 1 ;
	table['o'] = 1 ;
	table['u'] = 1 ;

	int i ;
	const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
	for( i = 0; i < upper_s ; i+=4 )
	{
		if( table[ bd.b[i] ] )
			bd.b[i] = '*';       

		if( table[ bd.b[i+1] ] )
			bd.b[i+1] = '*';       

		if( table[ bd.b[i+2] ] )
			bd.b[i+2] = '*';       

		if( table[ bd.b[i+3] ] )
			bd.b[i+3] = '*';       
	}

	for( ; i < bd.size_b ; i++ )
	{
		if( table[ bd.b[i] ] )
			bd.b[i] = '*';
	}
}

The result is then 0.000845947 seconds. I was almost ready to settle with it but I remembered one last detail: cache.

I’m using a 256 size lookup table. But I only care about the alphabet characters in the ascii chart. I could thus reduce the table to 32, which is the closest power of two multiple from 23. The bright side of having a smaller lookup table is that we manage to maintain it longer in the CPU’s cache. Less cache misses, faster algorithm. So here’s the code with that in mind:

__inline void subst( buffer_data bd )
{
	char *b = bd.b ;	
	char table[32] ;

	memset( table, 0, sizeof table ) ;
	table['a'-97] = 1 ;
	table['e'-97] = 1 ;
	table['i'-97] = 1 ;
	table['o'-97] = 1 ;
	table['u'-97] = 1 ;

	int i ;
	const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
	for( i = 0; i < upper_s ; i+=4 )
	{
		if( bd.b[i] < 97 || bd.b[i] > 122 )
			continue ;

		if( table[ bd.b[i] - 97 ] )
			bd.b[i] = '*';       

		if( table[ bd.b[i+1] - 97 ] )
			bd.b[i+1] = '*';       

		if( table[ bd.b[i+2] - 97 ] )
			bd.b[i+2] = '*';       

		if( table[ bd.b[i+3] - 97 ] )
			bd.b[i+3] = '*';       
	}

	for( ; i < bd.size_b ; i++ )
	{
		if( bd.b[i] < 97 || bd.b[i] > 122 )
			continue ;

		if( table[ bd.b[i] - 97 ] )
			bd.b[i] = '*';
	}
}

And the final result is 0.00084498. Not much faster from the previous algorithm but a lot faster when compared to the first one. In fact it’s over two and half times faster.

Now I’m sure that if I kept at it I’d find even better ways to optimize this algorithm, but I’m settling with what I got for now. With this exercise I just wanted to prove a few points:

1.      Knowing computer architecture and how code translates into assembly can be quite useful.

2.      Like Michael Abrash pointed out, there’s no such thing as the fastest code in the west.

3.      Knowing bitwise and pointer arithmetic helps as well.

4.      I should have a better social life.

Well I guess that’s it for now! Click here: Fast Vowel (87) to download the sample code and take a look at it yourself. Do keep in mind that I only tested this in one machine and in different environments the results may vary.

The comparisson table:

  • Dominuz code: 0.002205584
  • Local variable + inline : 0.001782024
  • vowel loop unrolling : 0.001161438
  • 256 lookup table : 0.000951169
  • 256 lookup table + loop unrolling : 0.000845947
  • 32 lookup table + loop unrolling : 0.00084498

Learning and re-learning

By J.Raza On April 4th, 2010

Today I’ll review two books I read:  Professional Assembly and Assembly Language Step-By-Step.

One interesting thing is that both approach the same topic, teaching assembly language, from different perspectives. Step-by-step takes care of explaining the history of Intel’s CPU architecture, from the segmented mode to flat mode, detailing the intricacies of segmented programming along the way. It takes over 150 pages just to get to the first line of assembly code in the book.

Professional assembly on the other hand is a rocket ship, taking no apologies and going full throttle into Intel’s assembly structures and the GNU’s assembler (gas) syntax.

Step-by-step takes care of explaining basic computer architecture so that in the end you can understand assembly and it’s logic. Professional assumes you know that and blasts off, which I think in the end makes it a better book.

You see I’ll probably go back to Professional Assembly when I need a reference or review a topic, because Step-by-step while good at explaining things, once you got them there ain’t much left to go back to. Now you may ask me, which one should you buy? My answer is:

Both.

You see, one thing that I learned is that it’s good to study a particular topic several times, even if you are already familiar with it. An authors approach, no matter how good it is, will not be universal. That’s because it’s his approach to the topic, there are other things that can either be better explained or better elaborated upon.

And the more you look and study at a particular topic, the more universal your approach to it will be.

Book Review

By J.Raza On March 22nd, 2010

I finish reading Write Great Code Vol2 yesterday. I must say it’s a good book, although with one complaint.

Let’s start with the good side. It’s very well written and throughly elaborate on explaining how your compiler turns the high-level language statements into low level assembly. It then goes on for hundreds of pages explaining how to optmize that, from function calls,  arrays, structures to small if/else jump conditions, local, static and global variables differences. Basically covers most of the issues a programmer will face or would ask himself “how will this turn into assembly”.

The only complaint I have is not in the book itself but the author over and over and over and over again telling you that if you want to look further into a particular topic that the book doesn’t cover (like line caches) he states that you should get his other book, Write Great Code vol1. I don’t really have anything against good advertisement, but it just gets tiresome after the 10th time he does that.

Still It’s a good read and I enjoyed it. My copy of WGCvol1 is already on the way from Amazon and i’m pretty sure it’ll be a good read as well.

Reading and Writing

By J.Raza On March 19th, 2010

Like I said on my previous post, I`ve been reading Write Great Code Vol2. One of the ideas behind reading it was gaining more assembly knoledge and being able to write and optmize better assembly code. That is happening, at a small pace but gradually, though something more interesting began to happen that I wasn`t really expecting.

I`m now am able to read my compilers assembly output and have a better understandment of whats going on behind the scene. Now this may not sound like a big thing, but take the following example:

I was coding a particle engine for a new game i`m working on. Now I wanted this game to be a particle fest, like Geometry Wars. You could describe the basic particle loop engine as:

  1. update particle position
  2. render particle
  3. repeat for every particle

Steps 2 and 3 are a bit of self explanatory, but i was spending way to much time on step 1. Out of curiosity I started looking at the assembly code on that section and saw that the debug build was taking way more overhead than necessary.

You see each particle is an object, and I was accessing its position via setters and getters. That was slowing way to much the debug build because of all the function calls. On the release the compiler would optmize it away, but since I tested the debug version more than the release, it slowed down the development process as a whole.

So I made a decision that most software engineers wouldn`t recommend and made the particles position public instead of private, and accessed them directly. Alas the debug version speed up!

I thought to myself that when trying to write fast code, sometimes giving up good software practices might do the trick. I was a bit skeptic at first, but ironically I ended up reading the exact same statemnt on Write Great Code Vol2. The author even used the same example of setters/getters.

Regardless of if this is a good practice or not, I`m glad I`m able to have a better undestandment of my code, how it works and to know what kind of choices should I take as each situtation arises.

I`m defiantly going to pick up Write Great Code Vol1 after finishing this one.

Assembly stuff

By J.Raza On March 10th, 2010

I’ve always admired Michael Abrash. If you don’t know who he is the man is sort of a legend. He helped develop the original quake engine, wrote tons of articles on how to get the maximum speed out of your pc in the early 90s when cycle counting used to be something respected.

I’ve always had interest in learning from him and reading his articles so I picked up courage and started reading assembly books for the 8086. I picked up a real nice one, Assembly Language Step-by-Step 2nd edition, and started studying.

I could say I “learned” this in college, assembly for the 8086 but it was a basic course and taught mostly the beginning stuff you know. mov this, add that, inc this, int 21h that. Just because I could understand the instructions from an individual point of view for me it was not enough, I wanted to get to the level Abrash talked about.

So I finished that book and have been reading and re-reading it as I go through Zen of Code Optmization and Write Great Code vol2.

I must say it’s being a bumpy ride, going back and foward with these books. Reading a chapter from one, going back to the other, while trying to understand what I just read. No wonder it takes time to master this sort of stuff.

Personally to me also it’s very satisfying to be able to read these sorts of books and be able to at least understand them. Gives a great sense of improvement. When I finish all 3 of them let’s see how I am.

But right now I want to share with you a small victory I feel I just had. One basic memory handling function in C is

memset( dest, val, size ) ;

I went inside it in it’s assembly code and managed to understand it. The most important instruction inside this function is

rep stosd

which is what causes the memory to be set once all the registors have been setup. Inside theres a bunch of checks for redundancies and type safeties, so I wrote the following that does that a memset and only the memory settting. No type checking, register juggling, no nothing. This is what I got:

mov eax, 10
mov ecx, 16
lea edi, dword ptr a
rep stosd

Which is basically what the Assembly book step by step teaches in one of its chapters. In the end this ends being 2 cycles faster than memset.

2 cycles faster. I am proud of myself ahah.

yes a small victory, but hopefully one of many to come. Let’s see how this goes.