Archive for the ‘optmization’ tag
Step by Step fast algorithm construction in native code 3 comments
A few weeks ago I saw this forum question where a user asked help on making an algorithm that:
“replaces all the vowels in a string with the character ‘*’”
The other users in the forum quickly replied and helped him with that question. For some reason though, that problem stuck to my head. Sure it’s a simple algorithm, but I thought what if I added a new constraint: It has to be done fast, really, really fast.
To me this created a whole new approach to the problem, and I was interested in seeing how far I could take it. So, for starters I took the best reply to the forum thread from user Dominuz (who himself stated that he focused more on clarity and ease of understanding rather than optimization )
Below is his sample code:
#include <stdio.h>
char vogais[] = {'a','e','i','o','u'};
void subst(char* nome)
{
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
int main()
{
char criatura[] = "12 Criatura aeiou 123456";
printf( "%s \n", criatura );
subst( criatura );
printf( "%s \n", criatura );
return 0;
}
For starters, I knew a simple text string would not be enough to measure different algorithms performance. So I decided to take a larger text as input, in this case Michael Jackson’s Wikipedia page as a tribute to the late king of pop. The total file was 273kb.
I modified Dominuz code to open the file, perform the subst function a thousand times, while recording how long it took to do that. The result was stored in a text file. Here’s my initial modification of his code:
#include <iostream>
#include <fstream>
#include <Windows.h>
using namespace std;
char vogais[] = {'a','e','i','o','u'};
void trunc_double()
{
fstream file( "out.txt", ios::out | ios::trunc ) ;
file.close() ;
}
void save_float( double d )
{
char buffer[256] ;
fstream file("out.txt", ios::out | ios::app ) ;
sprintf_s(buffer,"%.15f\n", d ) ;
file.write( buffer, strlen(buffer) ) ;
file.close() ;
}
char* malloc_and_setup_buffer()
{
char *b = NULL ;
int b_size = 0 ;
fstream file ;
file.open( "mj.txt", ios::in | ios::binary ) ;
if( !file )
return NULL ;
file.seekg( 0, ios_base::end ) ;
b_size = (int) file.tellg() ;
file.seekg( 0, ios_base::beg ) ;
b = new char[b_size];
file.read( b, b_size ) ;
file.close() ;
b[b_size-1] = '\0' ;
return b ;
}
void dealloc_buffer( char* b )
{
if( !b )
return ;
delete[] b ;
}
void subst(char* nome)
{
if( !nome )
return ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
int main()
{
LARGE_INTEGER lg0, lg1, frequency ;
QueryPerformanceFrequency(&frequency) ;
trunc_double() ;
for( int i=0; i<1000; i++ )
{
char* b = malloc_and_setup_buffer() ;
if( !b )
return 0;
QueryPerformanceCounter(&lg0) ;
subst( b );
QueryPerformanceCounter(&lg1) ;
float dt = (float)(lg1.QuadPart - lg0.QuadPart)/(float)frequency.QuadPart;
save_float(dt) ;
dealloc_buffer( b ) ;
}
return 0;
}
As you can see, I’m only worried on how long it takes to run the subst function. After putting this data into excel, I got that the average for sample one is 0.002205584 seconds. From that basic setup, I started the optimizations.
A quickly saw that you could force inline the subst function, making a call to it faster. However the biggest issue I had with it was accessing ‘vogais’ as a global variable, instead of a local function variable.
What’s the big deal about that? Well when you reference a global variable you are referencing an area of memory, contrary to when you access a local variable, which implies you are referencing the stack. And in terms of access speeds, stack beats memory.
So after re-implementing subst I got this:
__inline void subst(char* nome)
{
if( !nome )
return ;
char vogais[] = {'a','e','i','o','u'};
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
for( unsigned int j = 0; j < 5; ++j )
{
if( nome[i] == vogais[j] )
{
nome[i] = '*';
break;
}
}
}
}
After running it again and calculating the average I got 0.001782024 seconds. Not bad, I got a bit of an improvement over the previous algorithm. It still wasn’t enough to say I had a significant impact though.
If you look inside subst we have two loops, one iterates through each letter while the other iterates through each vowel. Well loops translate into assembly as jump instructions and those are expensive. So in order to get rid of jump instructions I performed a technique called ‘loop unrolling’. Below is the result:
__inline void subst(char* nome)
{
if( !nome )
return ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
{
if( nome[i] == 'a' || nome[i] == 'e' || nome[i] == 'i' || nome[i] == 'o' || nome[i] == 'u' )
nome[i] = '*';
}
}
Its average was 0.001161438 seconds. Not bad! Almost twice the performance when compared to the first one. But I wasn’t ready to finish yet.
You see my big issue with this model is that we are performing five vowel comparisons per letter. This translates to several compare and jump instructions in assembly and that’s just slow. I had to think of a way of getting rid of those comparisons.
After giving it some thought, I found a way out! I used a lookup table. Since bytes can only have 256 different values, I created a lookup table with 256 bytes in size. All bytes were set to 0, except indexes 97, 101, 105, 111 and 117, who were set to 1. What’s special with those values? Well they are exactly the indexes for the vowels a, e, i, o and u in the ascii chart.
I thus use the letter itself as the index in the lookup table, which indicates if it’s a vowel or not. Here’s the code:
__inline void subst(char* nome)
{
if( !nome )
return ;
char table[256] ;
memset( table, 0, sizeof table ) ;
table['a'] = 1 ;
table['e'] = 1 ;
table['i'] = 1 ;
table['o'] = 1 ;
table['u'] = 1 ;
for( unsigned int i = 0; nome[i] != '\0'; ++i )
if( table[ nome[i] ] )
nome[i] = '*';
}
The result? 0.000951169seconds. Not bad, managed to go in under 1 microsecond.
I kept thinking though that there was something else that I could add to this code… somehow I was missing something obvious. After a few minutes it hit me! I could do loop unrolling again and perform even less jump instructions!
After some testing, I decided to unroll the loop 4 times. For that I had to split the loop into two parts. The first part would have to iterate all the way up to the closest multiple of four, but not greater than the string’s size. The second loop I would need to take care of the last 3 potential characters left in the text
Now to reach the closest multiple of a number using integer arithmetic you do:
Number = ( number * multiple ) / multiple
Due to the nature of integer division, since it naturally rounds down the numbers, we reach our value. The problem though is that multiplication and division are expensive and should be avoided in fast algorithm construction. Is there a way to get rid of them?
Well yes! We just have to use bitwise operations. I didn’t just choose to unroll the loop four times for no reason. Since four is a power of two, it thus has the following property: adding it to a number, then performing the bitwise AND with its negative bitwise gives the closer or greater multiple of that number. We just subtract the number once from the result and alas, we have the closest or lower multiple of four from the original number.
So finally here’s the result:
__inline void subst( buffer_data bd )
{
char *b = bd.b ;
char table[256] ;
memset( table, 0, sizeof table ) ;
table['a'] = 1 ;
table['e'] = 1 ;
table['i'] = 1 ;
table['o'] = 1 ;
table['u'] = 1 ;
int i ;
const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
for( i = 0; i < upper_s ; i+=4 )
{
if( table[ bd.b[i] ] )
bd.b[i] = '*';
if( table[ bd.b[i+1] ] )
bd.b[i+1] = '*';
if( table[ bd.b[i+2] ] )
bd.b[i+2] = '*';
if( table[ bd.b[i+3] ] )
bd.b[i+3] = '*';
}
for( ; i < bd.size_b ; i++ )
{
if( table[ bd.b[i] ] )
bd.b[i] = '*';
}
}
The result is then 0.000845947 seconds. I was almost ready to settle with it but I remembered one last detail: cache.
I’m using a 256 size lookup table. But I only care about the alphabet characters in the ascii chart. I could thus reduce the table to 32, which is the closest power of two multiple from 23. The bright side of having a smaller lookup table is that we manage to maintain it longer in the CPU’s cache. Less cache misses, faster algorithm. So here’s the code with that in mind:
__inline void subst( buffer_data bd )
{
char *b = bd.b ;
char table[32] ;
memset( table, 0, sizeof table ) ;
table['a'-97] = 1 ;
table['e'-97] = 1 ;
table['i'-97] = 1 ;
table['o'-97] = 1 ;
table['u'-97] = 1 ;
int i ;
const int upper_s = (bd.size_b + 3) & ~0x03 - 4 ;
for( i = 0; i < upper_s ; i+=4 )
{
if( bd.b[i] < 97 || bd.b[i] > 122 )
continue ;
if( table[ bd.b[i] - 97 ] )
bd.b[i] = '*';
if( table[ bd.b[i+1] - 97 ] )
bd.b[i+1] = '*';
if( table[ bd.b[i+2] - 97 ] )
bd.b[i+2] = '*';
if( table[ bd.b[i+3] - 97 ] )
bd.b[i+3] = '*';
}
for( ; i < bd.size_b ; i++ )
{
if( bd.b[i] < 97 || bd.b[i] > 122 )
continue ;
if( table[ bd.b[i] - 97 ] )
bd.b[i] = '*';
}
}
And the final result is 0.00084498. Not much faster from the previous algorithm but a lot faster when compared to the first one. In fact it’s over two and half times faster.
Now I’m sure that if I kept at it I’d find even better ways to optimize this algorithm, but I’m settling with what I got for now. With this exercise I just wanted to prove a few points:
1. Knowing computer architecture and how code translates into assembly can be quite useful.
2. Like Michael Abrash pointed out, there’s no such thing as the fastest code in the west.
3. Knowing bitwise and pointer arithmetic helps as well.
4. I should have a better social life.
Well I guess that’s it for now! Click here: Fast Vowel (74) to download the sample code and take a look at it yourself. Do keep in mind that I only tested this in one machine and in different environments the results may vary.
The comparisson table:
- Dominuz code: 0.002205584
- Local variable + inline : 0.001782024
- vowel loop unrolling : 0.001161438
- 256 lookup table : 0.000951169
- 256 lookup table + loop unrolling : 0.000845947
- 32 lookup table + loop unrolling : 0.00084498
Fuck the ‘jeitinho brasileiro’ 1 comment
In Brazil I usually hear in social circles, or even in the news, the expression ‘jeitinho brasileiro’. It’s normally used to express a sort of “way of getting things done via the right person” attitude. It’s seen as a trait of Brazilians, that describes our social cordiality to one another, for it is our attempt to bend a system or process to aid someone in need. It’s usually regarded as a positive trait and encouraged for us Brazillians to have.
We shouldn’t.
The ‘jeitinho brasileiro’ displays a clear systematic failure in our capacity of getting things done. It displays our inability to create dynamic processes in a way that is fair, consistent and reliable to others. For if you need the cordiality of an individual at the right time, then clearly others will not have the same aid when they need it. There will be no basis to judge who really needs it or even if someone doesn’t deserves it all
The ‘jeitinho brasileiro’ thus creates a parallel system to the current processes being executed in which individuals can switch back and forth when possible to gain personal advantage over others. It thus induces people to follow the rules only as needed, as well as power play, and even to allow individuals the capacity to control such a system altogether. Let me give you an example;
A few months ago, I went to a bank to close an account, since I was leaving the country. The clerk told me that the only way to close it was if I made the request at the original agency where I opened it. The only problem was that this agency was 2 and a half hours away from where I was.
I left the bank and by luck found another one just a few blocks down the street. I entered the agency and used a very emotional ‘pwuese help mi mista’. They said I had to indeed go to the original agency to close the account, but after discussing a few minutes with one of his co-workers, he found a loop hole in the system. Basically he printed a check from the bank, put in my bank account information and with it I withdrew all the money I had in the account. When you empty an account, it prompts the clerk if he wants to close it and he used that trick to cancel my account.
Sure it was great he helped me with the ‘jeitinho’ but that’s exactly the problem. If a system inadvertently possesses a loop hole, why isn’t it being fixed? If a system possesses a way to aid users, why isn’t it being standardized?
That’s the problem with the ‘jeitinho’. With it, we can use it to keep and even maintain a clearly inefficient system. Since the system is ‘working’ why fix it? It hides weak links that need to be fixed….. and yes that implies firing people. It also implies in hiring people… trainning them… and all that comes with it.
Unless things keep being done in the ‘little way’ they’ll stay exactly like that: Little.
Book Review no comments
I finish reading Write Great Code Vol2 yesterday. I must say it’s a good book, although with one complaint.
Let’s start with the good side. It’s very well written and throughly elaborate on explaining how your compiler turns the high-level language statements into low level assembly. It then goes on for hundreds of pages explaining how to optmize that, from function calls, arrays, structures to small if/else jump conditions, local, static and global variables differences. Basically covers most of the issues a programmer will face or would ask himself “how will this turn into assembly”.
The only complaint I have is not in the book itself but the author over and over and over and over again telling you that if you want to look further into a particular topic that the book doesn’t cover (like line caches) he states that you should get his other book, Write Great Code vol1. I don’t really have anything against good advertisement, but it just gets tiresome after the 10th time he does that.
Still It’s a good read and I enjoyed it. My copy of WGCvol1 is already on the way from Amazon and i’m pretty sure it’ll be a good read as well.
Reading and Writing no comments
Like I said on my previous post, I`ve been reading Write Great Code Vol2. One of the ideas behind reading it was gaining more assembly knoledge and being able to write and optmize better assembly code. That is happening, at a small pace but gradually, though something more interesting began to happen that I wasn`t really expecting.
I`m now am able to read my compilers assembly output and have a better understandment of whats going on behind the scene. Now this may not sound like a big thing, but take the following example:
I was coding a particle engine for a new game i`m working on. Now I wanted this game to be a particle fest, like Geometry Wars. You could describe the basic particle loop engine as:
- update particle position
- render particle
- repeat for every particle
Steps 2 and 3 are a bit of self explanatory, but i was spending way to much time on step 1. Out of curiosity I started looking at the assembly code on that section and saw that the debug build was taking way more overhead than necessary.
You see each particle is an object, and I was accessing its position via setters and getters. That was slowing way to much the debug build because of all the function calls. On the release the compiler would optmize it away, but since I tested the debug version more than the release, it slowed down the development process as a whole.
So I made a decision that most software engineers wouldn`t recommend and made the particles position public instead of private, and accessed them directly. Alas the debug version speed up!
I thought to myself that when trying to write fast code, sometimes giving up good software practices might do the trick. I was a bit skeptic at first, but ironically I ended up reading the exact same statemnt on Write Great Code Vol2. The author even used the same example of setters/getters.
Regardless of if this is a good practice or not, I`m glad I`m able to have a better undestandment of my code, how it works and to know what kind of choices should I take as each situtation arises.
I`m defiantly going to pick up Write Great Code Vol1 after finishing this one.
Assembly stuff no comments
I’ve always admired Michael Abrash. If you don’t know who he is the man is sort of a legend. He helped develop the original quake engine, wrote tons of articles on how to get the maximum speed out of your pc in the early 90s when cycle counting used to be something respected.
I’ve always had interest in learning from him and reading his articles so I picked up courage and started reading assembly books for the 8086. I picked up a real nice one, Assembly Language Step-by-Step 2nd edition, and started studying.
I could say I “learned” this in college, assembly for the 8086 but it was a basic course and taught mostly the beginning stuff you know. mov this, add that, inc this, int 21h that. Just because I could understand the instructions from an individual point of view for me it was not enough, I wanted to get to the level Abrash talked about.
So I finished that book and have been reading and re-reading it as I go through Zen of Code Optmization and Write Great Code vol2.
I must say it’s being a bumpy ride, going back and foward with these books. Reading a chapter from one, going back to the other, while trying to understand what I just read. No wonder it takes time to master this sort of stuff.
Personally to me also it’s very satisfying to be able to read these sorts of books and be able to at least understand them. Gives a great sense of improvement. When I finish all 3 of them let’s see how I am.
But right now I want to share with you a small victory I feel I just had. One basic memory handling function in C is
memset( dest, val, size ) ;
I went inside it in it’s assembly code and managed to understand it. The most important instruction inside this function is
rep stosd
which is what causes the memory to be set once all the registors have been setup. Inside theres a bunch of checks for redundancies and type safeties, so I wrote the following that does that a memset and only the memory settting. No type checking, register juggling, no nothing. This is what I got:
mov eax, 10
mov ecx, 16
lea edi, dword ptr a
rep stosd
Which is basically what the Assembly book step by step teaches in one of its chapters. In the end this ends being 2 cycles faster than memset.
2 cycles faster. I am proud of myself ahah.
yes a small victory, but hopefully one of many to come. Let’s see how this goes.
