Wasn't sure which forum to put this... Sowwy...
It took soom time to get the last one completed in PureGDK but that was because I've been in pretty bad shape lately, but I got it done and managed to produce a "almost" realistic comparison between Dark Basic Pro, Dark GDK and Pure GDK.
It was originally intended that I write this simple game in Pure GDK first, but I started in Dark GDK first (16 days), then converted it to Dark Basic Pro (less than 36 hours) and then converted the Dark GDK version over to Pure GDK.
The results on my machine:
TGC Product Benchmark Test
==========================
Tested on:
HP Pavillion AMD Turion64 Mobile Technology
Windows XP SP3
ATI Radeon Xpress 200M
1Gb Ram
1.99Ghz Processor
|Level 1 |Level 3 |Level 5
|Peak FPS |Peak FPS |Peak FPS
---------------+------------+-----------+-----------
Dark Basic Pro |21 |20 |20
Dark GDK |19 |14 |12
Pure GDK |22 |17 |15
---------------+------------+-----------+-----------
Level 1 - 4000 trees
level 3 - 6000 trees
level 5 - 8000 trees
Converting the code to Dark Basic Pro was very easy and whilst doing it I threw in some of my own optimisations on the major "bottlenecking" routines. One such routine which bottlenecked the game was the calculating the closest tree. In both GDK versions I re-wrote these entirely in assembler code:
Dark Basic Pro version: (using the nested IF's, and local variables)
function TREES_FIND_CLOSEST()
local count as integer
local cdist as integer : local dist as integer
local xd as integer : local yd as integer
local bx as integer : local by as integer
bx = bike.xpos : by = bike.ypos
cdist=8000000
count = tree_count
while count >= 0
xd=bx - trees(count).x
if xd>-4
if xd<4
yd=by - trees(count).z
if yd>-4
if yd<4
dist=(xd*xd)+(yd*yd)
if dist<cdist
cdist=dist
Endif
endif
endif
endif
endif
dec count
Endwhile
Endfunction cdist
The GDK version (both Dark GDK and Pure GDK) in C++: (almost there optimised but still slow)
int TREES::find_closest() // BOTTLENECKING ROUTINE (Optimised slightly)
{
int a=count;
int curr_dist = 8000000;
register int xd, yd, dist;
// optimise the bikes positions for local vars and using ints...
int bx = (int)gg->bike.xpos;
int by = (int)gg->bike.ypos;
//vector<TREEXY>::iterator p = trees_xy.begin();
TREEXY *p = &trees_xy[0]; // not using the slow STL stuff (standard ansi C is faster)
while ( a )
{
// use the iterator and get the distance
xd = bx - p->x;
if ( xd > -4 )
{
if ( xd < 4 )
{
yd = by - p->z; // z and not y
if ( yd > -4 )
{
if ( yd < 4 )
{
dist = ( xd * xd ) + ( yd * yd );
if ( dist < curr_dist )
{
// store the current closest distance
curr_dist = dist;
}
}
}
}
}
p++; // next iterator position
a--;
}
// return the distance of the closest tree
return curr_dist;
}
When I got to that point and looked through the disassembled code there was tons of unnecessary commands. So I re-wrote it completely in assembler and using just the registers (apart from curr_dist) to iterate through the loop. It ended up like this:
int TREES::find_closest_opt() // BOTTLENECKING ROUTINE (Optimised slightly)
{
int a = count;
int curr_dist = 8000000;
int xd, yd, dist;
// optimise the bikes positions for local vars and using ints...
int bikex = (int)gg->bike.xpos;
int bikey = (int)gg->bike.ypos;
//vector<TREEXY>::iterator p = trees_xy.begin();
TREEXY *p = &trees_xy[0];
__asm {
mov ecx, [a] // use reg for main loop
mov edx, [p] // use reg for array pointer
while_loop:
mov eax, [bikex]
sub eax, [edx+4] // p.x
cmp eax, -4
jl next_loop
cmp eax, 4
jg next_loop
mov ebx, [bikey]
sub ebx, [edx+12] // p.z
cmp ebx, -4
jl next_loop
cmp ebx, 4
jg next_loop
imul eax, eax
imul ebx, ebx // cache pause here on second imul
add ebx, eax
cmp eax, [curr_dist]
jg next_loop
mov [curr_dist], eax
next_loop:
add edx, 16
dec ecx
jnz while_loop
}
// return the distance of the closest tree
return curr_dist;
}
Three different versions of the same function.
The original compiled C++ code took over 3 times the above amount of assembly code. I've used the standard C++ function call to setup the local variables then jumped straight into the assembler. The only time the cache is held up is when it actually finds a tree within range, apart from that if the cpu has many pipelines in its cache it will run upto 6 asm instruction per cpu clock cycle. Probably more on the later processors. This is actualy so much faster even than drawing a 32 pixel by 32 pixel sprite to the screen, even on upto 10,000 iterations as it's cache friendly.
Other optimisations is asm where also made, mainly to the tree functions. Unfortunately these made no difference to the Dark GDK version, but did up the Pure GDK version by upto an extra 2 FPS. To be honest I was actually hoping that Pure GDK wouldn't be that far behind.
I believe the reason for Dark Basic Pro coming out on top is that the GDK version have to "call a function that calls a function". Pure GDK has been faster than Dark GDK from the beginning but with recently use it to convert this project over I have found that Dark GDK does offer a lot more internal function calls, although with a lot of persistence and advisably a very clear head (not half tired), you can get access to these.
Pure GDK is one I'm going to stick with for my upcoming projects and I'll still use DBPro to test out various things.
Gold - Dark Basic Pro
Silver - Pure GDK
Bronze - Dark GDK
Attached is all three versions DBP, Dark GDK and Pure GDK of the bike game. And I might look into benchmarking maybe some other function groups if anybody has any ideas of what to try out.
Thank you...
Mental arithmetic? Me? (That's for computers) I can't subtract a fart from a plate of beans!
Warning! May contain Nuts!