@MikeS
As promised, an in-depth(somewhat) analysis of the papers that I've found. I'll summarise them and point out what I think is usefull about each of them. Some of these you might find rather interesting.
[NOTE]
While I was writing this over the past few days I was able to download several papers for free from ieee's site. Although the page said you needed a subscription to download, I thought what the hell, may as well try. And low and behold I was able to download quite a few texts from ieee. I figuared all this talk about buying articles must have changed and they didn't update their site. Well two days later when I'm checking the links for these papers I notice that I can't download anything from them without a username and account.
So it appears they had a bit of a security glitch and now quite a few of the links to the papers I mentioned are dead. I've searched the net for copies but apparently only ieee had a few of them. I've replaced the links with a [No Link] piece to sigify which papers I got and couldn't replace from ieee.
Sadly, many good papers I have came from ieee so that leaves me with a bit of a dilemma. I guess if you are interested in checking out these papers yourself you could email me and I could give them to you but I don't know if it is worth the hassle.
Since I spent a fair bit of time writting these little synopsis and semi-reviews for them I'll leave in the parts that mention the now unobtainable papers. After all, I wouldn't want all of that to go to waste, eh?
Anyway, without further ado, here are the semi-somewhat in depth analyses of the papers I've found recently.
A VIRTUAL MACHINE FOR INTERPRETING PROGRAMS IN STATIC SINGLE ASSIGNMENT FORM
http://www.ics.uci.edu/~wangn/uci-ics-tr-03-12.pdf
This paper describes how to build a virtual machine that will interpret SSA form. SSA form is a widely used and simple intermediate language that compilers use for optimizations. I think this could serve as a stepping stone for building an optimizing compiler by allowing you to get acquainted with SSA form without having to build a full blown compiler.
It could also serve as a usefull debugging tool for your compiler as well when you build one.
An interesting note: This document contains the C source code for the inner core of the virutal machine. It also has several references that you may be interested in as well.
This one is definately worth giving a look at sometime soon.
ACCELERATING SINE AND COSINE EVALUATION WITH COMPILER ASSISTANCE
[No Link]
This one is more of interest to me as Basic4GL is designed to run graphical programs and Sine and cosine are abundant in them. It works by using approximations of sine and cosine and breaking up the code for them into two parts. The first part is the code that is common for both of them and the second part is to add small finishing pieces of code for the both of them.
The method described has several advantages. First and foremost is the fact that existing compiler optimizations will allow for a cosine and sine operation on the same data to be optimized to the point that they execute in roughly the same time as a single sine or cosine operation would!
The method described also is extended to include tangets and hyberolic functions.
It looks like this could be of great use to me.
On a side note: This technique isn't completely theoretical either as its been implemented in a production compiler. The "HP-UX compiler for the Itanium." So it has some soild performance statistics to back it up as well.
INTER-PROCEDURAL LOOP FUSION, ARRAY CONTRACTION AND ROTATION
[No Link]
This paper is straight out of Intel's Compiler Labarotary out in Santa Clara, CA.
This technique has also been implemented in a production compiler as well and apparently gave a 12% increase in the SPECfp 2000 benchmarks.
The essential problem that these techniques are trying to solve is the fact that processor speeds are increasing much faster than memory speeds. These techniques try to improve the locality of the data referenced. Basically more cache hits and less cache misses which means less time spent trying to fetch data from memory which can be real slowed compared to getting it out of the on chip cache or a register.
This document then goes into a VERY detailed discussion of how to go about implementing these techniques. It nicely describes inter-procedural analysis and the steps involved in it. There are a lot of diagrams in here and as a visual learner I really appreciate the throughness Intel achieved by littering this document with these diagrams.
There are really only two drawbacks I can see to this paper. The first is the need for a profiler.
The paper mentions the need for a profiler but it also talks about static profiling methods. But clearly, the best info you can get is from dynamic profiling methods as they are the best shot you've got at accurately measuring what will get executed and when.
The second drawback is that it is discussed with the Itanium processor in mind. This isn't too bad as the techniques mentioned are general enough that one could easily apply them to Intel's more popular processor.
It's hard to sum up all that is covered in this document. Although there are just 11 pages it is jammed full of usefull practical info.
Even if your compiler writing days are a ways away yet, I'd suggest giving this doc a read as it is real interesting and informative.
AN EXPERIMENTAL EVALUATION OF DATA DEPENDENCE ANALYSIS TECHNIQUES
[NO LINK]
This is a more recent paper(March 2004) that deals with three different data dependence analysis techniques and there various performances with regards to compile time, accuarcy, etc.
This one isn't too interesting, but it does serve as an overview of what appears to be the three dominant data dependence analysis techniques out there and their various pros and cons.
Not too helpful right now, but might prove usefull in the future.
COMPILING LANGUAGE DEFINITIONS: THE ASF+SDF COMPILER
http://arxiv.org/PS_cache/cs/pdf/0007/0007008.pdf
This might be of interest to you.
Quote: "
The ASF+SDF Meta-Environment is an interactive language development environment whose main application areas are definition and implementation of domain-specific languages, generation of program analysis and transformation tools, and production of software renovation tools.
"
I think what it does is basically helps you to write functional languages. Sort-of a YACC or LEX only beefed up to include more.
I remember you saying that you were going to port your work over to C eventually so since this outputs C source code I thought you might be interested.
COMPILER OPTIMIZATION - SPACE EXPLORATION
http://www.cs.princeton.edu/~nvachhar/papers/cgo01_ose.pdf
One of the major problems facing compiler optimizations today is the unpredicatable effect that various optimizatons can have on each other. It's entirely possible for an optimizaton to increase performance in one case while decreasing performance in another.
Take Common Sub-Expression Elimination as an example. In Common Sub-Expression Elimination, CSE for short, you remove operations that will yield identical data and leave only one instance of that operation with it's result stored in a temporary variable. For example, examine the following code:
a = b + c
d = b + c + e * b / c
In this example b + c is computed twice which is unneccessary since the values of b or c do not change between the time that the first b + c is and the time that the second b + c is computed. So a more optimal piece of code would look like the following:
a = b + c
d = a + e * b / c
But suppose that we were short on registers and for some inexplicable reason a needed to be evicted from the register that it was residing in. So basically, we only have room for d, b, c, and e, but not a. Now in the original code we wouldn't have had a problem since a could safely be moved back to memory while d takes its place. But thanks to our "optimization" that is no longer the case.
We now will have to push a value out of a register and into memory so we have room. Then we will have to pop that value back into the register that it was pushed out of to complete the equation. These relatively expensive memory access will completly override any gain that we obtained from only computing b + c once.
Now this may seem far-fetched but in reality it can be pretty common. Remember, the x86 only has 4 general purpose registers, and in the above not too complicated example we ended up needing 5 registers thanks to our "optimization".
So the effect of optimizations on each other is of great interest to the compiler writer. Especially x86 compiler writers who have to deal with a horribly small amount of registers.
What this paper deals with is how to effectively ascertain the impact of your optimizations not only on each other, but on the perfomance of your compiled code. It also outlines the key failures of the existing methods, predictive heuristics and itinerative compilation mainly, and proposes a solution to the problem. By combining perdictive heuristics with itinerative compilation, they managed to achieve a higher code quality while at the same time keeping the compile time to a reasonable amount.
PRACTICAL DEPENDENCE ANALYSIS IN A SIMD VECTORIZING COMPILER
[No Link]
This is an interesting paper. It deals with detecting data dependence using the Bresenham line algorithm. A novel idea, and it appears to be quite quick and exact. Its rather short and to the point(its only eight pages long with the last page filled with references) but interesting and possibly quite applicable.
A REGION-BASED COMPILATION INFRASTRUCTURE
http://ipf-orc.sourceforge.net/region-final-version.pdf
This paper deals with dividing a program up into sections using something other than function scope as the boundries. According to this paper, not only do the described technique speed up compilation time significantly(one benchmark compiled 63% faster!), but they also increased execution time of the compiled program.
This is definately worth looking in to as the performance gain in not only compile time, but in execution time is hard to shurg off as uninteresting.
However, the method doesn't do data-dependence analysis yet so there is still some room for improvement of it.
COMPILER-DIRECTED RESOURCE MANAGEMENT FOR ACTIVE CODE REGIONS
http://api.ece.uic.edu/~Interact7/paper4.2.pdf
As mentioned eariler, memory latenices are beginning to become a real source of slow down and this is only going to get worse as time goes on.
This paper deals with "improving cache effiecency by utilizing compiler-directed adaptive cache management techniques". In short, it talks about a series of compiler optimizations that deal with cache effiency in small "hot spot" regions of code.
Some interesting results were had from the experiments. In one benchmark, a section of code that was only executed 3% of the time was optimized and resulted in a 19% performance improvement in the program overall. They go on to describe this in detail and why this occured.
Its an interesting paper if you would like to improve cache performance.
MINIMUM REGISTER INSTRUCTION SEQUENCING TO REDUCE REGISTER SPILLS IN OUT OF ORDER ISSUE SUPERSCALAR ARCHITECTURE
http://citeseer.ist.psu.edu/573385.html
Lately, I've been interested in register allocation and the various methodologies that are used.
While researching the subject I came across this paper which claims that in 99.2% of the cases studied their solution was optimal. That's an impressive number so naturally I decided to include this paper although it really focuses on non-x86 architecture.
But despite that it shows promise considering the fact that the problem that it solves is closely related to Optimal Code Generation.
I'd look into this paper in the future, but not nearly so early as some of the other papers mentioned.
USING THE COMPILER TO IMPROVE CACHE REPLACEMENT DECISIONS
ftp://ftp.cs.umass.edu/pub/osl/papers/pact_zlwang_2002.ps.gz
[Note]
I could only find a postscript version of the paper. Originally it was in a pdf format, but I got this one from ieee so no chance in getting the pdf version now. If you don't have something that will display postscript google GhostScript. That program should allow you to display it.
[End Note]
Yet another paper on reducing cache misses. This one is more general than the last and talks about varying architectures. It seems rather promising and it claims to have seen reductions in execution time of 4.89%, 8.39%, and 6.94%. It also predicts that in 5 years time we'll reductions of 9.99%, 15.62%, and 12.31% for the respective programs.
It appears that memory speeds are significantly behind processor speeds and as current trends go that gap is only going to widen. To put this into perspective, processor speeds have been increasing 55% for every year after 1987 while memory access latenicies have only been improving by about 7%. So their predictions for performance improvements for 5 years from now aren't all that far fetched.
COST EFFECTIVE COMPILER DIRECTED MEMORY PREFETCHING AND BYPASSING
[Note]
This look like a powerpoint presentation turned into a pdf. I included here for two reasons:
1. The original text version came from ieee.
2. It looks neat when you use the scoll wheel on the mouse to scroll down.
http://moss.csc.ncsu.edu/pact2002/slides/ortega117.pdf
[Other Note]
I found a good copy of the original.
http://www.csc.ncsu.edu:8080/faculty/mueller/pact02/papers/ortega117.pdf
You can really tell I'm interested in caches lately.
As the name suggests, it deals with prefetching which was introduced in the SSE extensions for the Pentium III(though it doesn't actually mention the Pentium III or SSE, the methods introduced apply to them as well).
The performance gained can be quite siginificant as one benchmark showed an increase of about 102% and an average of 43% speed increase against software with no software prefetching.
The methods outlined here in conjunction with eariler methods mentioned could be combined to create a powerfull optimizing force for data-dependent applications.
PROCEDURE MERGING WITH INSTRUCTION CACHES
http://citeseer.ist.psu.edu/292876.html
This one is rather old. It was written in 1991 so if you want to follow its methods just be on the look out for something better because by now someone probably came up with a superior technique.
Anyway, this paper deals with inter-procedural optimizations with regards to instruction cache performance. It uses a combo of profiling info, data structures, the cache size and cache miss penalty as indicators as to whether the procedures in question should be merged or not.
I believe that this paper was the first to measure the likelyhood of a procedure merge based on the effects it would have on the instruction cache. Procedure merging has always been a murky sort of business with the rules to define whether a procedure to be merged or not are rather arbitrary. There are trade-offs to be made between code size and execution time that are not always easy to ascertain.
I think this is nice start when it comes to procedure inlining, but the intel one is way better in terms of quality presentation and relevance. Still, I wouldn't discount this paper yet. It could prove rather usefull as a stepping stone for getting into procedure merging. I didn't see anything too complex in it so I guess it is worth giving a look at to at least compare it Intel's methods.
EVALUATING INLINING TECHNIQUES
http://citeseer.ist.psu.edu/539953.html
This is an overview of various inlining techniques and a new hybrid approach that they introduced. It is a more formal treatment of the subject than previous papers I've mentioned.
The paper is rather old at 1996 but it is still usefull. It uses a probabilistic model though, and as a much newer paper that I mentioned demonstrated, these models are inherently flawed. Figuaring out whether a function should be inlined or not in all cases except for the most basic is at best an NP-Hard problem. Its prohibitively expensive to calculate the precise effect of an inline for all functions.
FIAT - A FRAMEWORK FOR INTERPROCEDURAL ANALYSIS AND TRANSFORMATION
http://citeseer.ist.psu.edu/37218.html
This one is a keeper. It describes an abstract framework for interprocedural analysis. From the looks of it, it appears to be a rather easy to integrate, modular design for an interprocedular analyizer.
It doesn't modify source code either. It only deals with identifying areas where transformations can be applied in a safe and profitable manner. So you can choose whether to inline a function or not based on your own internal algorithms.
The ease of use of this system seems to be pretty impressive as well. In a short time it was integrated into several major compiler systems with no changes being needed to the core source code.
This along with Intel's paper are probably the top two inter-procedural analysis papers that you should start with if you ever consider implementing inter-procedural analysis in your compiler.
OVERCOMING THE CHALLENGES TO FEEDBACK-DIRECTED OPTIMIZATION
http://www.eecs.harvard.edu/hube/publications/dynamo00.pdf
I've listed several papers so far that make mention of Feed back directed optimization techniques.
However, I haven't really given any papers that deal with just feed-back directed optimization. This one describes feed back directed optimization and the challenges it currently faces.
Although I don't entirely agree with them, they do make some excellent points that are worth considering. I'd give this one a read if you are interested in feed-back directed optimization.
PRECISE REGISTER ALLOCATION FOR IRREGULAR ARCHITECTURES
http://citeseer.ist.psu.edu/435214.html
When it comes to register allocation you are going to be hearing a lot about graph coloring. Graph coloring was originally developed by Chatin when he was working at IBM. Graph coloring was the first significant leap forward in the science of register allocation.
However, it was developed with a RISC machine with 16 registers in mind. Graph coloring algorithms for architectures that have less than 16 registers, like our poor x86, don't do to well.
Enter integer programming.
This paper describes an approach to register allocation that uses something called integer programming to solve the problem of register allocation for irregular architectures. The best part about this paper is that the x86 was selected as a test case to see how well it did. Other papers so far have dealt with different architectures or described things generally, but with this paper we have a method that is directly applicable to x86 architecture.
The authors set up their algorithm and test it against GCC's graph coloring algorithm and get some surprising results. Their IP allocater reduces register allocation overhead by 61% and produces faster code!
This is a must read paper if you are going to build a compiler for the x86 architecture.
GRAPH-COLORING REGISTER ALLOCATION FOR IRREGULAR ARCHITECTURES
http://citeseer.ist.psu.edu/382779.html
Just to be fair, I thought I'd let these guys talk about how they allocate registers on irregular architecture using a modified graph coloring algorithm.
They discuss several modifications the graph coloring algorithm and their work seems to solve the problem of irregular architectures not being amiable to graph coloring. They don't mention any real performance statistics though so I'm a little suspicious as to whether this is better than the integer programming approach.
The mention the above paper in their references section but they don't comment on its performance vis a vis their algorithm. The cynic in my head thinks that their algo, though better than other graph colorers, didn't measure up as favorably against an integer programming method. But that's just my suspicion.
OPTIMAL SPILLING FOR CISC MACHINES WITH FEW REGISTERS
http://citeseer.ist.psu.edu/339374.html
This is a more up to date approach at solving the register allocation problem for x86 architectures.
It mentions integer programming as a solution to register allocation and the current work being done with it. From what I can gather they got pretty far and with promising results but they ran into a problem with optimal register coalescing. They gave two solutions but neither were vary satisfactory as both had tradeoffs of speed of compilation versus speed of executable.
But this paper came two years after the other paper regarding IP register allocaton so it does bring one up to speed more with the current field.
I'd suggest reading PRECISE REGISTER ALLOCATION FOR IRREGULAR ARCHITECTURES first than this to get a better grasp of the challenges that we face.
EFFECTIVE INSTRUCTION SCHEDULING WITH LIMITED REGISTERS
http://www.eecs.harvard.edu/hube/publications/cheng-thesis.pdf
This one is a big one. Its someone's thesis on optimal instruction scheduling. There is a whole lot here and it contains a lot of usefull information. Some of the info here also applies to register allocation as well. In fact, there is even an entire appendix devoted to graph coloring.
But that aside, instruction scheduling is very important, especially on the x86 with its multiple pipelines. Properly scheduled instructions can boost performance up to four times or more depending on the architecture. For example, AMD's Athlon processors can execute up to 4 independent ADD instuctions at a time.
This paper, although real long(134 pages), could prove rather usefull in boosting the performance of your compiled applications.
EFFICIENT INSTRUCTION SCHEDULING USING FINITE STATE AUTOMATA
http://citeseer.ist.psu.edu/205264.html
This one I included just for fun. I found a reference to it in one of the papers bibliographies and I thought it'd be interesting.
It deals with instruction scheduling using two finite state automata and produces some interesting results. I found the paper to be quite descriptive if a little out dated(1995).
You might want to give it a try when you have the spare time or become interested.
REGISTER ALLOCATION VIA GRAPH COLORING
http://citeseer.ist.psu.edu/43119.html
No mention of graph coloring would be complete with out mentioning Briggs. I heard that this was recommended reading for anyone interested in graph coloring and I found out why. This 154 page thesis details extensively the history of graph coloring and its modern application.
Although it was written in 1992, you'll see a lot of references to this paper in any paper that concerns graph coloring. It is rather interesting read and I'd suggest giving it a look even though you may decide against graph coloring as your register allocation scheme of choice.
PROCEEDINGS OF THE GCC DEVELOPER'S SUMMIT 2003
http://www.linux.org.uk/~ajh/gcc/gccsummit-2003-proceedings.pdf
I thought I'd add this in as the icing on the cake.
There are a ton of interesting papers in this PDF and I think it is definately worth looking into.
There is just too much to summarise here. Check it out if you have (a lot of) some spare time.
Now that those are out of the way, why don't I discuss some interesting links I've managed to dig up.
First up, a page on compiling Java code that gave some interesting references:
http://www.bearcave.com/software/java/comp_java.html
Look toward the bottom under "References". You should see several descriptions of books that may be of interest to you.
I also found someone's notes on the compiler that they implemented. The compiler is aimed at x86 architecture and looks like their could be several interesting tidbits of information here:
http://www.smlnj.org//compiler-notes/
Of note, is his notes on implementing register allocation for the x86. The notes are in PostScript though so make sure you have something like GhostScript to read them.
I don't know if you've seen this one before, but it is a very helpful page on code optimization.
http://www.azillionmonkeys.com/qed/optimize.html
Well that's all for now. Hopefully next time I'll be posting PureBasic code on how to build a .exe.