Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

Code Snippets / [GDK] - [C++] Visual Studio Inline Assembler

Author
Message
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 18th Sep 2011 14:24
Visual Studio Inline Assembler x86

I've finally come back to experiment with the inline assembler within visual studio. Mainly because I am needing an extremely fast random number generator as a lot of my future code will be working with procedural content. Now that I've got a complete but very simple random number generator working I thought I'd post it.

When using the inline assembler from within visual studio you are free to use EAX, EBX, ECX, EDX, ESI, and EDI registers. Anything that's left in EAX will be the return value.

Why the inline assembler?

Because inline assembler code is extremely fast! Simple...

Inline assembler example code

I've left some of the old code commented out so you can see where I started with this.


You may note the '//' in the above code, that's because whilst I was learning the inline assembler I originally mixed it within C++ itself, but at the end of the day I wanted just a pure inline assembler function.

The above function is defined as:



Which will allow for two types of call, one to set the seed and with just empty braces will return a random number value.

Using the inline assembler it is going to be so much faster than using:



To test whether this inline assembler function did actually work I threw a simple test in using Pure GDK (can use GDK too of course):



As I experiment as well as re-learn assembler programming I will post more examples up and hopefully others will find them useful too.

Warning! May contain Nuts!
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 18th Sep 2011 21:46
Okay this next function is a bit rubbish but now I'm getting used to using the inline assembler. This time I've mixed C++ and assembly (note the removal of _declspec(naked) ).

A few times in the past, and in these forums, I've been told not to worry about optimising my code, but bad habits don't die so quick. So now I'm teaching myself x86 after many years of z80 and 680x0.

In DBPro, GDK or PGDK you could not get close to the speed of this:



Warning! May contain Nuts!
Mr Bigglesworth
16
Years of Service
User Offline
Joined: 4th Mar 2008
Location:
Posted: 19th Sep 2011 05:53
This looks like something I may want to learn, given its complexity
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 20th Sep 2011 03:10
It's not actually that complex, it just takes a few more instructions to write out to actually do something simple like addition.

To get the instruction reference just google "x86 reference".

I'm just starting to play about with the FPU (Floating Point Unit) so that I can do heavy duty maths calculations much faster than C++ can.

Leaving some variables in registers also speeds up execution time over small pieces of code instead of reading and writing from memory locations.

Warning! May contain Nuts!
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 25th Sep 2011 18:31 Edited at: 25th Sep 2011 18:51
Simple use of the FPU:

After a good few hours of fiddling with this tiny little bit of code I did manage to get a result. I had to remove the (naked) and have it return a value in a variable.



The reason I'm learning the basics of asm and the use of the FPU is that I want to be able to at the end of my project optimise some of the heavier calculations.

The internals of the FPU gives you direct access to Cosine, Sine, Square Root, Pi, Power of, and Tangents, which can obviously give a vast speed up of heavy calculations.

If the above code was a function that calculated a 3D distance formula and that was required over hundreds to thousands of times during a game loop then the speed increase would help.

Bad habits die hard...

EDIT: The only problem I'm having at the moment is that because I've had to remove the (naked) part the compiler is now adding extra code to save registers which I'm not using in the function. I'm determined to get around this because if a function returns a float value then it should be left on the FPU's stack in st(0).



Warning! May contain Nuts!
Da_Rhyno
12
Years of Service
User Offline
Joined: 25th May 2011
Location:
Posted: 25th Sep 2011 19:40 Edited at: 25th Sep 2011 20:00
If you need any assistance/tutelage on the FPU instructions, I have a link to a website which, while rather old, is very usable and helped me learn.

Here you are: http://www.website.masmforum.com/tutorials/fptute/

(This is mainly for anyone who may need some help with it, though I probably should have put a ASM tutorial there as well, but the one I used growing up is no longer online.)
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 25th Sep 2011 23:52
I'm not getting very far with using the __declspec(naked) and getting a return from the FPU in st(0) so I'm having to stick with the compiler inserting the prolog and epilog code. Ah well, a few clock cycles won't matter that much. At least I can get away with (naked) functions with returns as integers.

Apparently you can do it if you're just using MASM but I'm just figuring out x86 after being so used to the 680x0.

@Da_Rhyno - I've got that site already bookmarked and I've been referring to it quite a bit lately. Thank you.

Warning! May contain Nuts!
Da_Rhyno
12
Years of Service
User Offline
Joined: 25th May 2011
Location:
Posted: 26th Sep 2011 02:32
I'm wondering if it's possible that the compiler is sticking something else in that register. The reason I bring it up is because I think that the FPU shares the same registers as the MMX routines IIRC.
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 26th Sep 2011 02:47 Edited at: 26th Sep 2011 02:48
@Da_Rhyno - On the disassembly everything seems fine when using the __declspec (naked) but I just don't get the results expected. I also get two different sets of results when I run my test code in Debug and Release. For the time being I'll have to stick with the restrictions of the MSVC inline assembler.

I did also notice that the st(0) register would be empty after returning from the function during debugging.

On a positive note, I've managed to convert the perlin noise function that is called the most when generating height maps and textures.

The original C++ code:


And converted to ASM: (not using memory addressing variables constantly)


After a review of this there isn't actually any noticeable difference in speed as the MSVC compiler does do a commendable job. Saying that though, if the removal of the prolog and epilog code was possible in MSVC there would be another slight increase. The difference only really comes from being able to use registers instead of memory located variables during large math formulas.

Warning! May contain Nuts!
WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 27th Sep 2011 00:27 Edited at: 27th Sep 2011 00:34
Converting C++ to Inline Assembler - More FPU code

This time I've managed to figure out what I was doing with converting the Interpolate function from the Perlin Noise class. I had to reverse the 'fmul' as well as pop from the FPU stack.

Before I did convert to inline assembler, I decided to use a different interpolate method, using the Cosine instead of Cubic method so I get an even better speed increase and still get the detail required.

The original C++ function:


The final converted version in Inline Assembler:


Tracking down errors in inline assembler code!!! A nightmare!!!

In the above code I have these lines, which are now fixed.


I originally used this:


'fmulp st(1),st' - Multiplies st(0) with st(1) and leaves the result in st(1) and then pops the stack leaving just the result in st(0). And that was my error. I was leaving values on the stack and getting the wrong results. It was through looking at the disassembled code I realised this as the FPU stack wouldn't cause the program to crash. The learning curve of inline assembler and studying the FPU is proving difficult but not impossible.

Warning! May contain Nuts!
IanM
Retired Moderator
21
Years of Service
User Offline
Joined: 11th Sep 2002
Location: In my moon base
Posted: 3rd Oct 2011 22:05
Quote: "I had to remove the (naked) and have it return a value in a variable."

IIRC, any floating-point type is returned in ST(0). In addition, you need to ensure that your floating-point stack is empty when your function returns (except for the possible return value).

You can do that by issuing an 'finit' at some point (again taking care not to lose any floating-point return value).

Here's a quick breakdown of the return registers.
float / double - returned in ST(0)
up to 32 bits - returned in EAX
up to 64 bits - returned EAX (low) / EDX (high)
over 64 bits - done by storing the valuee in memory and returning a pointer to that memory in EAX.

DBPro's float values are returned as if they were 32 bit values rather than floats (ie, in EAX rather than ST(0)).

Quote: "if the removal of the prolog and epilog code was possible in MSVC"

Project properties -> C/C++ -> Optimisation -> Omit Frame Pointers = Yes

This frees up the EBP register too, so potentially means more variables in registers during calculations and therefore faster processing.

TBH though, although it's a great learning experience, unless you know the actual timing of machine code instructions and their optional modifiers, and how each one affects instruction pipelining, caching etc, then you might as well stick with highly optimised C++ - YMMV.

For instance, I don't use machine code in my plug-ins, except where it can do something that C++ itself can't (for instance, returning both double, 8 byte and 4 byte results to DBPro at the same time when there's no way to know what the caller is expecting).

WLGfx
16
Years of Service
User Offline
Joined: 1st Nov 2007
Location: NW United Kingdom
Posted: 4th Oct 2011 04:33
It's definitely an old habit of mine as all my old stuff on the Atari and the Amiga was done in C and 680x0 assembler code. And while I've been reading up on this new stuff, I've learned a lot about the pipelines and caches. Mainly using integer registers I've found you can get almost blitter like speed with optimised code, which has made me think about memory manipulation and transfers, especially with bitmaps. Which has made me start to learn some more. Now with the dual core etc out, some of the code can run better than almost "ghost mode" like.

Just like the old days (20 years plus ago), most of my code will be in C but end of the line optimisations I will do in assembly. Only when I need to speed something important up. This learning curve has opened my eyes to this new technology and being able to have up to 5 or 6 instructions run at only 1 clock cycle is just amazing.

I've got lots more ideas for DBP plugin features but I will have to study such things as accessing DBP itself from within the plugin, ie for bitmaps, images, sounds, object, etc.

My main project is soon about to start after many months of toying with this, learning that, and it has come in very useful finally re-learning assembly code. Although the C++ side is still an issue, as one of the game programming guides says, C code is good enough. I've enough knowledge about C++ to pull it off. And along the way I'll very likely still be trying out new things.

I'm actually going to start using a separate assembler soon instead of using the inline assembler for some of the changes, as the inline assembler is actually restricted on many things that MASM isn't. Such as creating a normal function to return FPU ST(0). Slightly difficult in MSVC. It also has better control for segments. All still a part of re-learning it all over again.

The FPU I've almost mastered after many experiments. Something I never used to do on the Amiga. MMX and SSE instructions I've not looked into yet but will get around to it.

As always, I'll always code in C/C++ then polish it later on.

Mental arithmetic? Me? (That's for computers) I can't subtract a fart from a plate of beans!
Warning! May contain Nuts!

Login to post a reply

Server time is: 2024-03-28 22:07:25
Your offset time is: 2024-03-28 22:07:25