Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

Geek Culture / Take on C ; (And all the great links on Compiliers)

Author
Message
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 9th Oct 2006 03:22
Creating a Compiler - Tutorial 1 - Linked Lists

One of the most important data structures in a compiler and programing in general is the linked list. It is the basic means with which we store items of data in sequential fashion when we don't know how many of the items we will have ahead of time. It is also the first data structure we will encounter in our compiler.

Our Linked List code uses what's called a list header. A list header stores a pointer to the current item in our list as well as a pointer to the first item in the list, the number of items in the list, and the size of the data held in each item. The following type represents our header:


The ItemH type is the Item Header which prefaces each item. It contains a pointer to the next item and a pointer to the previous item. So each item in a list would look something like this:


Here is the type that represents the item header:


Now here is the code we will uses for initializing our linked list:


The above function returns a pointer to the list header which will be used by all of our linked list functions. We pass in an unsigned integer which represents the size of the data we will store in our linked list. Here is the code for the add item function:



A good portion of the code is needed to fix up the addresses of any items ahead or behind our current item that we are adding in. Whenever we add an item it becomes our current item. If we were to add a few items, then move to the middle of the list and add another item we would have to fix the address of the previous item to point to our new item and we would have to fix the address of the next item to point our new item.

An Example:


Now that we know how to add items we can move on to editing their data. The following function will get the pointer to the current item in the list:



An odd feature of FreeBasic that was introduced sometime around version .13 is that when you add an integer to a pointer it scales the integer to the size of its type. For example say you have the following type:


And you have a pointer for this type called MyPtr. If you were to add 1 to this pointer you would increment the pointer by 8 since the size of the type is 8 bytes(2 four byte integers). So the following code:


Is equivalent to this in version .12 or less of the FreeBasic compiler:


So when you see the following line of code:

What is going on is that the List header is accessing the pointer to our current item and incrementing it by the size of the item header. The pointer is now pointing to the item's data and is returned.

Once we have a series of items we will then need to traverse our list to access them. There are two functions that we use for traversing a list: FirstItem() and NextItem().

Here is the fairly simple code for FirstItem():


Here is the code for NextItem():


Both are rather self explainatory so lets move on to an actual example of the above functions in action. The following piece of code demostrates how create a item, assign data to it, and read it back:



Finally, we move on to looping through each item. There are a variety of ways to tour through each item in a list, but we'll stick to the simplest for now.



The above piece of code will loop through all items in list. It can also skip over the list if there are no items within it.

We will be revisiting this file later, but for now the functions we've laid out will suffice for our purposes.
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 9th Oct 2006 03:24
Here are the first two of my tutorials(well first one not counting the intro). I have more finished, but I'll start posting them next week. If you have any questions or think that something is lacking in the tutorials feel free to post. I'll try to answer any questions and edit the tutorials so that they make more sense as I go along.
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 16th Oct 2006 07:06
Creating a Compiler - Tutorial 2 - Basic Lexing

Lexing is a process of taking a text file and breaking it up into pieces called "Tokens". These tokens are the atomic pieces of a program that will be read by the parser and turned into our intermediate language. We will be working with these tokens quite a bit as they form the basis for which we will evalutate the syntatical and semantical correctness of our program.

We'll start with the basic form of a token and in later tutorials move on to more complex definitions. The most basic form start with any letter or a '_' followed by any number of alpha-numeric symbols or underscores. This is our standard ID token and it will be used for all of our variable and function names. We'll store our token in the following type:



Note: We will be modifing these types later in other tutorials.

Our main lexing function that is called once per file accepts a file name as input and returns a handle to a linked list containing our lexed file. Consequentaly, we will need to include our linked list file at the start of our lexer file.



With that out of the way we can now move on to our main function:



As you can see, it is quite straight forward. After opening the file, the loop in the middle will read every line of text off of the file and send it to a function called LexLine. The function LexLine is where the majority of the work in our lexer will occur.

Here is the LexLine function in it's entirety:


Now let's break it down piece by piece. The function has three parameters. A pointer to our current line which contains the string of characters that we want to turn into a token. A pointer to a list header which will allow us to add our tokens to a linked list. And finally, our current line number. This will be used by our error checking code to alert the programmer to the location of the offending token.

The next batch of code consists of a series of variable declarations that our function will be using. TempToken is the 256 byte buffer(255 characters + the null byte) which we will scan our tokens into as we build them up. We use a buffer because it is much faster than constantly resizing a dynamic string everytime we add a new character it. The downside to this approach is that we are limited to tokens that are less than 256 characters in length. Since this almost never occurs in practice our limit isn't much of a hinderance. However, if it becomes necessary to have tokens of a greater length then it is possible to create a linked list of buffers. It will be left to the reader to carry out that exercise though.

CurrentPosition is the variable that holds the current position in our TempToken where we will write our next character to. It is incremented each time a character is added to our TempToken. It is set to zero(the beginning of the TempToken) each time a token is completed.

CurrentChar is the current character that is extracted from our line with each itineration of our main loop. It is tested against a set of rules to determine whether our lexer will add it to our current token or finialize our current token.

LineIndex is the index into our current line. It is incremented every loop and is used to fetch our CurrentChar.

T is a temporary pointer of type "Token" used when we add a token to our linked list. In order to assign our token to it's item in the list we must first get the pointer to that newly created item. That pointer is stored in T.

After the variable declarations is a single line of safety code. If we have nothing in our current line than we immediately exit. This occurs frequently as every time you hit the enter key you create a null byte. Often times, programmers hit enter repeatedly to space out there code to make it more readable, thus creating many empty lines.

We are now left with our main loop. This single DO-LOOP will itinerate through every byte in our current line and break it up into a series of tokens. The first line of the loop fetches our current character that needs to be examined. After that, our character is tested against a series of rules to determine what the behavior of our lexer will be next.

Let's think of each If or ElseIf statement as a rule. The first two rules are for dealing with capital and lower-case letters in the alphabet. If we encounter them we add them to our TempToken buffer and move our CurrentPosition over to the next byte in our buffer to be written to. Fairly, simple.

The third rule is for dealing with numbers. It is exactly like our first two rules so will move on to our next rule.

The next rule deals with what happens when we hit space. Now a space is a delimiter for our parser. What a delimiter is is a character or in some cases a token that indicates to our parser that we have encounter the end of our current token that we have been building up in our TempToken and it is time to send it to our linked list. We add a null byte to our TempToken buffer so that only our most recent string is read and not any junk after it that we may have accumulated in the course of parsing our file.

We first encounter a conditional statement after that that checks for a null byte as the first character of our TempToken buffer. The reason for this is that it is possible for there to be multiple spaces packed together. After the first space there would be nothing in our TempToken so there would be no point in creating a token which contains precisely nothing.

One of the fundament goals in a lexer is to strip away all useless information and only send what is necessary to the syntax and semantic checker. This makes writing code for the syntax and semantic checker much easier as less scenarios have to be considered and coded for.

Within our conditional statement our four lines of code that create our new item and assign our token and it's linenumber to it. It is rather straight forward so there is no real need to dwell on it.

After that our CurrentPosition variable is reset to the start of our TempToken buffer and that is the end of our rule concerning space characters.

Our final rule is for when we encounter the null byte at the end of our line. The Null byte rule is quite similar to our space character rule. First make sure our current token in our TempToken buffer is properly null terminated. Then if there is anything in our TempToken buffer add it to our token list. Only instead of reseting our CurrentPosition variable we exit our line lexing function because we now have no more line to lex.

At the end of our series of rules is a final else clause that serves as an all purpose error catcher. If our lexer were to encounter a character which it had no rule for it would then throw an error and print a message to the screen detailing the offend character in question and it's position. The sleep command is used as a simple wait-for-key type function. After a key is pressed the program will simply exit. This is all rather basic at the moment and will be changed at a later date to a more suitable error catching piece of code, but for now it will suffice for our purposes.

At the tail-end of our main loop are two importent pieces of code. This first merely increments our line index variable allowing us to fetch the next character that needs to be examined. The second is a test to see if our CurrentPosition variable is greater than the number of characters we can write into our buffer. If the error is detected a simple message is displayed and the program exited.

Now that our simple lexer functions have been covered here is a simple demo program with a test file to put our code in action:



File to test it on. Named "Simple Example.txt":


The program should output the following:



This concludes tutorial 2 in our series. Next up is a more advanced look at lexing where we finally get to lex our first actual assembly program!
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 23rd Oct 2006 05:51
Creating a Compiler - Tutorial 3 - Advanced Lexing


Now that the basics have been covered it is time to move on to the more advanced aspects of lexing. In this installment of our tutorial series we'll go over lexing strings, and the end of line character. This will allow us to lex our first complete assembly program. Here is our first functional assembly program.

File named "Source.txt":


The first line of code imports the function ExitProcess from kernel32.lib using "__imp__ExitProcess@4" as the symbol to search for. The actual code for the program is just two instructions. A push instruction which pushes 0 onto the stack. And a CALL instruction which calls the imported function ExitProcess. Basically, the program starts then immediately quits. There is not much to it, but it is a vaild program and windows will run it so it is a start.

There are a lot of new characters introduced in this source file. Let's start with the first new one we encounter: the quote character("). When we encounter a quote mark we want the lexer to include everything after it in the same token regardless of its content up untill we encounter another quote mark. So we will create a new mode for our lexer which skips the normal set of rules and uses a special set for when we encounter a quote character.

First we add the variable "Mode" at the top of our LexLine function:


Then we wrap our rules with a giant if clause like this one:


Now in the regular rules we add a new rule underneath our space rule:



This new rule starts out much like our space rule. Whatever is in our TempToken is finialized and sent to our token list if there is anything in it. The CurrentPosition variable is then reset so we can begin a new token. What makes this rule different from the space rule is that one little line at the end where Mode is set to 1. This is our flag variable that lets our lexer know when it is time to enter our string lexing mode.

Now we go down to the ElseIf clause for Mode = 1 that we added earlier. We add the following three rules:



The first rule simply checks for another quote mark. When it encounters it it takes whatever is in our temptoken and adds it to our token list. After reseting our CurrentPosition variable to point to the beginning of our TempToken buffer it returns the Mode to our lexing mode.

The second rule checks to see if we hit the end of the line. This shouldn't occur because we need a second quote to close off our string literal which means we have an error. An error message is printed and then the program will exit after a key press.

The third rule is just the catch rule. Whatever it encounters in our CurrentChar it will add to our TempToken buffer.

Now that we have our string literal the question arises, "how do we tell it apart from a normal token?" The answer is simple: with tags. Tags are bit flags that allow us to pass information about the tokens to later stages of our compiler. It is a way of pre-calculating certain identifing information for quick retreival later. In this case, we will modify our above code to set a StringLiteral flag so we can determine if we have a string literal or a regular token on our hands when we are parsing.

So let's add our Tag type at the type of our file:



If it looks a little sparse it's because we haven't added all of our flags in yet. Later on we will add more flags in when it becomes necessary.

Add the following line of code to the bottom of our Token Type:



After that we can begin modifying our string lexing rules. The only rule we really need to change is the first one. After this line of code:


We insert this piece of code:


And that finishes that paticular task.

Moving on to our next lexing task. In parsing there is a need to tell where one statement begins and another ends. In some languages like C there is a special character designated for this job. The ";" character. But in BASIC and assembly the null byte at the end of the line must suffice. Therefore, we need to keep track of the null byte and give it it's own token.

There is a danger in just blindly adding a new token everytime we come across as a new null byte. Frequently, we will encounter multiple lines of blank space between statements. If we add a new token everytime we encounter a null byte, it will look like we have multiple statements consisting of nothing. In our pursuit of minimal wasted tokens, we will have to track how many tokens we add each line. If none we're added when we go to create a token for our null byte than we don't add a null token. This keeps the number of null tokens down to one per statement.

Add this declaration to the top of our LexLine function:


Now, each time we add a new token, place this line of code after it:



Simple, no?

We'll also add an EOL(end of line) flag to our tag.


Now rewrite our NULL byte rule so it looks like this:



That finishes this tutorials additions to our compiler. Below is a small demo program that you can use to test the new additions to the compiler.



Should output:
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 30th Oct 2006 13:23
Creating a Compiler - Tutorial 4 - Symbol Table

Before we move on to Syntax Checking we must first learn one of the most pivitol data structures in a compiler other than a linked list: a symbol table. A symbol table is a data structure that allows you to store information by a "key." This is similar to how an array allows you to store information by an integer index number. The "key" for our symbol tables will be a string that holds an identifier. Each symbol table holds a different kind of indentifier and it's associated information.

Let's start with the initialization of the symbol table. Our symbol table will use a linked list to hold the necessary information. This is a simple and straight forward implementation of a symbol table. However, it is not very optimal. Right now we'll focus on simplicity and functionality and get to performance later.

Top of the symbol table file:


Initialization function:


A brief explanation of the Table type is in order. Offset is the offset into the symbol datatype which holds our key. SymbolList is a pointer to the list header for all of our symbols. Each item in that symbol list share a single datatype designed for the paticular identifer that our symbol table will hold. For example, say we want to have a symbol table to hold all of our imported functions which we will store by name. So we create a type called ImportFunc which includes a field called FuncName which will hold our imported function's name. When we create our symbol table we pass two pieces of information. The first is the size of our type which in our example is the size of ImportedFunc. The second piece of information is the offset into that type where our symbol key will be stored which in this case is FuncName.

So you would see something like the following line of code when initializing your symbol tables:


Adding symbols to your symbol tables is an easy process. Just pass in the pointer to your symbol table header and the new key to your symbol.

All that the AddSymbol function does internally is add another item to the SymbolList and fill in part of the item data with the key.



Finding a symbol in a symbol table is very straight forward. You merely pass the pointer to your symbol table header and the key you are searching for.

Internally, all that the FindSymbol does is cycle through the linked list trying to match the key to any symbols in the list.



Getting the symbol data is just like getting the item data from an item in a linked list. In fact, the GetSymbolData function is just a wrapper for the GetItemData function.



That covers the basic of our symbol table functions. Below is a small demo program to demonstrate our new symbol table functions.

Symbol Table Demo program:
MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 1st Jan 2007 00:43
Wow Neophyte, although I'm months late in finding this, this is excellent to see all the information you've put up. I'm going to spend the rest of my holiday break, breaking this down.



A book? I hate book. Book is stupid.
(Formerly Yellow)
TKF15H
21
Years of Service
User Offline
Joined: 20th Jul 2003
Location: Rio de Janeiro
Posted: 1st Jan 2007 01:00
Whoa! I hadn't seen this either! o_O
*copies*

Three Score
20
Years of Service
User Offline
Joined: 18th Jun 2004
Location: behind you
Posted: 1st Jan 2007 08:50
Me neither!!

/me compiles it into a nice text file I can access offline

Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 22nd Jan 2007 05:32 Edited at: 22nd Jan 2007 05:34
Creating a Compiler - Tutorial 5 - Syntax Checking

Syntax is the form or structure of a language. It is a series of rules that determine what is and is not a valid sequence of tokens. Semantics is the meaning of a language. It is a series of rules that determine what exactly the meaning of the token sequence is and can be. Together they form what is called the grammar of the computer language.

In this compiler tutorial we will be looking at the former.

Before we begin writing our syntax checking functions it will help to first review the parsing expression grammar we will be using. Our parsing expression grammar is essentially a heirarchy of rules. Each rule has a one to one matching with a syntax checking function. For example, consider the following rule: PUSHOP -> 'PUSH' LIT_NUM. On the left side of the arrow is our function and rule name, PUSHOP. To the right of the arrow is the series of rules that our contained with in our PUSHOP rule. The first rule encountered is the word "PUSH" enclosed in apostrophes. Whenever you see a word in apostrophes in our grammar that means that the token our function is examining must match that word exactly. Afterwards, there is LIT_NUM. Since this isn't surrounded by apostrophes it is the name of another rule in our PUSHOP rule. In this case, it refers to the literal number function which checks to see if the token is a literal number or not. Putting it all together, our token sequence will conform to the PUSHOP rule if it starts with a token containing 'PUSH' followed by a literal number.

Now let's try another. Our next rule is called CALLOP. It consists of the following: CALLOP -> 'CALL' ID. It is quite similar to PUSHOP except for one crucial difference. Instead of LIT_NUM we have ID. ID is a function like LIT_NUM except it checks to see if the token is a valid identifier. You will recall from our lexing tutorial that an identifier is a token that begins with either a letter of any case or the symbol "_".

Observant readers will note that by now we have just defined the syntax for the two assembly instructions in our source program that we have been using for our tutorials so far. However, we haven't really demonstrated how they fit into the heirarchy of the grammar. So without further ado I present to you a more complete grammar:

UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID

The first rule is called the UNARYOP rule. Both arguments of the UNARYOP rule happen to be other syntax rules. When a new syntax rule is encountered it is usually immediately defined below where it was encountered. The lower the rule in a heirarchy, the less it relies on other syntax rules.

Of paticular interest is the occurance of the '|' operator. The '|' operator means that either the argument on the left of '|' or the argument on the right of '|' can be true in order for the rule to hold. So in order for a sequence of tokens to count as a unaryop it must be either a callop or a pushop.

There are many different kinds of operators that can occur in our grammar. They typically occur directly after an argument or in between two arguments. Here is a brief list of some of the most common operators:

Operator: ?
Occurence: Right after an argument.
Meaning: The preceding argument can occur in a rule or it can be omitted. It is optional.

Operator: *
Occurence: Right after an argument.
Meaning: There can be 0 or more occurences of the rule.

Operator: +
Occurence: Right after an argument.
Meaning: There can be 1 or more occurences of the rule.

The last operator in the above list is the only other operator we will be using in our grammar for now. The complete grammar for our assembly program is as follows:

Program -> Line+ EOF
Line -> IMPORTDECL | UNARYOP EOL
IMPORTDECL -> 'IMPORT' STR_LIT 'AS' ID 'FROM' STR_LIT
UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID

The Line and IMPORTDECL rules are fairly self explainatory, so let us focus on the first rule, the Program rule. PROGRAM consists of a Line rule and an EOF rule. EOF is rather simple. It stands for end of file. Line is defined below, but it is followed by something we haven't encountered before. The + operator. As explained previous, the + operator means 1 or more instances of the preceding argument can occur. In our case, this means that 1 or more lines can occur in a program. Simple, no?

Computer language design can get quite complicated so having a parsing expression grammar around can help to organize your thoughts and prevent you from making language design errors.

Now that the theory is out of the way we can get into the practical side of our tutorial: the coding.

There are two primary tasks that our syntax checking code must accomplish. The first is to gather any type or declaration information for the semantic checking code which will be explained in a future tutorial. The second is to check the list of tokens to make sure that they are syntacially correct according to our parsing expression grammar. In this tutorial we will deal only with the syntax checking task. The information gathering task will be explained in the next tutorial dealing with semantic checking.

Let's start with our main syntax checking function appropriatly titled SyntaxCheck:



SyntaxCheck takes a list header as an argument and checks all tokens in that list. It starts by first doing a safety check to see if there are any tokens actually in the list. Once that is established it moves on to a basic for-each loop, itinerating through each line of tokens in the list. The bulk of the work is done in the LineSyntaxCheck function. Very little will change about the SyntaxCheck function because it is quite basic.

Of note is a new linked list function called 'IsLastItem'. It checks to see if the item in the list is the last item. The code for it is rather simple and should be placed in the Linked List.bas file. It is as follows:



With that out of the way, we can move on to our LineSyntaxCheck function:


The LineSyntaxCheck is structured in a very straight forward way. For each argument in our Line rule we have a function that corresponds to the rule name. Recall that our Line rule in our grammar currently looks like this:

Line -> IMPORTDECL | UNARYOP EOL

In English this would read something like this:

Check to see if our ImportDecl rule passes. If it fails check to see if our UnaryOp rule passes. If both fail then we have a failure of the Line rule. Otherwise check to see if our next token is an End Of Line token. If it is then our Line rule passes. This description closely matches the above code.

Before we get into the code for the ImportDecl and UnaryOp, we should first cover two basic utlitiy functions that will be used in both rules, but aren't defined elsewhere. The first is called "IsID" and basically checks to see if a token is classified as an Identifier or not. The second utility function is called "IsNumber" and is similar to "IsID" only it checks a token to see if it is a number or not.

Utility Functions:


The code for the ImportDecl rule's function is quite long so we will break it down into parts. The first part is the function declaration and a variable declaration:


ImportDecl takes a list header as an argument. This list header is for our token list. The function also returns an unsigned integer. This is used for signaling whether the rule succeeded or failed(True or False, 1 or 0). The S variable that is declared will be a temporary variable that holds a pointer to our current token that is being checked.

Now there are two kinds of checks that will occur with in this function: terminal and non-terminal. Non-terminal checks occur at the beginning before any keywords are encountered. It is possible for these checks to fail, but the program to still be valid. Case in point, so we have a statement that reads PUSH 0, and we run the ImportDecl function on it. Obviously, since the statement is not an import declaration the ImportDecl function should fail. But we want to continue on with our syntax checking so it shouldn't halt the compiler. Thus it can be labeled non-terminal since it allows future syntax checking functions to look over the code to see if it makes sense to them.

Terminal checks, however, must halt the compiler because there can be no alternative. The sequence is guarenteed to be false so there is no need to run any further checks on it. A good example of this would be an import declaration that was missing an 'AS' keyword. After we encounter an 'IMPORT' keyword we can be sure that what follows must be an import declaration. So any failures to hold to the grammar should be viewed as syntax errors and not just a mismatched rule. Halting the compiler and reporting the error then becomes necessary.

Our first non-terminal check looks like this:



The first thing our code does is fetch a pointer to the current token so we can read its contents. Then it checks the token tag to make sure that we aren't at the end of a line for some reason. After that comes our first and only non-terminal check. The first thing that should be encountered in a Import Declaration is the 'IMPORT' keyword. If we don't have that then we can't have an 'IMPORT' declaration and we should return false.

After our only non-terminal check we move on to a series of terminal checks for the rest of the statement:



The above piece of code follows a pattern that will soon become familar. First it advances to the next token. Each terminal check is responsible for placing itself on the correct token from where the preceding check left off. Next it will of course fetch a pointer to the data for the token. Afterwards we arrive at the terminal check. For the above check we see if we have a string literal or not. Recall that after our Import keyword there is a string literal that represents the symbol name of the function we want to import.

The question of what to do when an error is encountered is a sticky one. Advanced compilers have the capacity to report the encountered error and all other errors in the program. Accomplishing this is quite complicated and beyond the scope of this tutorial series. A more basic approach is to simply report the first error that is encountered and exit the compiler. We will be taking the simplier approach for this tutorial. It is left up to the reader as an exercise to implement a more advanced approach if they think it is necessary.

The next check is for the keyword 'AS'. It follows the pattern as well.



The following check is for an identifier. This identifier will be the ID of the symbol we will use to call the imported function.



After our identifier we require the keyword 'FROM'.



And Finally, we require a string literal that will identify the library that the import declaration is importing from. If this check succeeds then we can exit the import declaration function with success.



With our import declaration code out of the way we can move on to the last syntax rule for this tutorial: the UnaryOp rule. Recall that our grammar for the UnaryOp rule looked like this:
UNARYOP -> PUSHOP | CALLOP

The UnaryOp rule consists of two rules lower down the heirarchy: PushOp and UnaryOp. So the rule should look like the following in code:



The grammar of the PushOp and CallOp rules consists of a keyword then an argument. For the PUSHOP rule the code will look like the following:



For the CallOp rule, the code will look like this:



This concludes Tutorial 5 of our Creating a Compiler series. As always, please feel free to post about any non-working code, or helpful tips for improving the series. Next up, the Parsing Semantics phase and the data structures that support it.
Neophyte
22
Years of Service
User Offline
Joined: 23rd Feb 2003
Location: United States
Posted: 22nd Jan 2007 05:40
Sorry for the delay between posts in the series. I've been busy with work and this tutorial was proving harder to organize and write than I anticipated. On the plus side, I've reevaluated some of the early code that I wrote and improved it for this tutorial. It will be making it's way into my compiler eventually.

The next few tutorials should be much quicker to write up until I get to the part about explaining how to write a COFF object file. I'm dreading having to explain that part of my code.

Also, let me know if some of this code isn't compiling. It is somewhat different that what I have in my current working version. I've made quite a few minor changes and I'm not 100% sure that it is in sync with what I have working on file.
TKF15H
21
Years of Service
User Offline
Joined: 20th Jul 2003
Location: Rio de Janeiro
Posted: 22nd Jan 2007 15:05
Really good work!
I'm waiting for what comes next, my compiler already gets code and turns it into an ARM program, I'd like to see the x86 equivalent. ^_^
Keep it up!

PowerSoft
20
Years of Service
User Offline
Joined: 10th Oct 2004
Location: United Kingdom
Posted: 25th Mar 2007 19:21 Edited at: 25th Mar 2007 19:22
So, a few years on...hows everyone's efforts going? I've started to get some design doc's written..

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock
Kevin Picone
22
Years of Service
User Offline
Joined: 27th Aug 2002
Location: Australia
Posted: 25th Mar 2007 20:45 Edited at: 11th Aug 2010 22:13
PowerSoft
20
Years of Service
User Offline
Joined: 10th Oct 2004
Location: United Kingdom
Posted: 25th Mar 2007 21:12
Ah yes, PlayBASIC...

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock
MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 25th Mar 2007 22:25
I've been extremely busy with school, so progress is pretty much at a hault. I'll probably have more time during spring break as it approaches. No progress has been lost at least, I have all the resources I need, it's really just a matter of building it now.



A book? I hate book. Book is stupid.
(Formerly Yellow)
Kentaree
22
Years of Service
User Offline
Joined: 5th Oct 2002
Location: Clonmel, Ireland
Posted: 26th Mar 2007 13:42
I wrote a simple compiler that integrates with GCC for my college project about a year and a half ago, and I'm designing another language at the moment which will hopefully be independant

MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 12th Jun 2007 06:49
Still adding some links to the beast. Finishing this compiler and doing some other 3D work are my main summer projects as far as coding goes. Sooner or later, it will be finished.

http://www.avhohlov.narod.ru/p9800en.htm
Pretty good site with lots of compilers/interpreters with source. A few of them are under 1000 lines as well, so a good starting point for beginners.

http://www.freetechbooks.com/forum-14.html
Many many books on compiler construction

A book? I hate book. Book is stupid.
(Formerly Yellow)
Three Score
20
Years of Service
User Offline
Joined: 18th Jun 2004
Location: behind you
Posted: 13th Jun 2007 08:44
wow...this thread is still alive!? lol

I've wanted to build a compiler quite a few times..though I just don't see the point when I know C and C++..

Open86 --My Emulator (now with it's first super alpha release
I'm addicted to placebo's...I would quit but it wouldn't mean anything! lol
MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 13th Jun 2007 19:50
Yup, this thread will remain alive as long as I'm an active forum user(which has been, roughly 5 years, so it's not going to die soon). I'm convinced that this thread alone is one of the best modern resources on the internet for compiler building.

Quote: "though I just don't see the point when I know C and C++.."

The chances of us making a compiler more optimized than either C or C++ compilers on the market are slim. That's not necessarily the point though. Think of DBP. It's compiler has been built specifically for game creation, making game creation a very simple process for users like us.

The most important part of compiler design however, would have to be parsing. That is something that can be applied to many many things, other than just compilers.




A book? I hate book. Book is stupid.
(Formerly Yellow)
PowerSoft
20
Years of Service
User Offline
Joined: 10th Oct 2004
Location: United Kingdom
Posted: 13th Jun 2007 21:18
Also it's good to know what makes things tick to enable you to use them in an even better way.

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock
PowerSoft
20
Years of Service
User Offline
Joined: 10th Oct 2004
Location: United Kingdom
Posted: 27th Aug 2007 12:04
Sorry, finger slipped.

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock
MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 1st Apr 2009 06:13
Just a small update. I haven't given up on this very very long term goal. I am taking a compiler building class at university, and will post my knowledge and maybe a finished product after I've completed the course. Keep on working on all of your compilers, and don't give up!



A book? I hate book. Book is stupid.
(Formerly Yellow)
Kevin Picone
22
Years of Service
User Offline
Joined: 27th Aug 2002
Location: Australia
Posted: 1st Apr 2009 06:45 Edited at: 11th Aug 2010 22:13
Wow, talk about a blast from the past..

MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 1st Apr 2009 07:14
Heh, yup, and we're all still here. I'm convinced that this thread is one of the most valuable resources on the internet for compiler design. Hope everyone can give their updates on their progress if they're still leering around.



A book? I hate book. Book is stupid.
(Formerly Yellow)
MIDN90
16
Years of Service
User Offline
Joined: 11th Mar 2009
Location: Colville, Washington
Posted: 2nd Apr 2009 05:42
College?
JoelJ
21
Years of Service
User Offline
Joined: 8th Sep 2003
Location: UTAH
Posted: 2nd Apr 2009 06:06
Quote: "College? "

What?

Your mother has been erased by a mod because it's larger than 600x120
MIDN90
16
Years of Service
User Offline
Joined: 11th Mar 2009
Location: Colville, Washington
Posted: 2nd Apr 2009 06:07
Is MikeS in college?
MikeS
Retired Moderator
22
Years of Service
User Offline
Joined: 2nd Dec 2002
Location: United States
Posted: 4th Jun 2009 05:52
Haha, yes I am indeed in college. As of today I have completed my compiler for an assembly based language. It's been quite the journey, and getting started early(4ish years ago) helped me really cruise threw this course.

I am going to continue to pursue some compiler design classes in college and develop some more advanced languages.

After I finish this term, I may write a brief tutorial on a simple compiler or interpreter now that I understand all of the pieces.

I haven't any designs on my own compiler, but I may toy around with some ideas, or at least build some tools to assist myself and others in writing their own compilers. I hope everyone elses projects are still coming along. It took me longer than expected, but I finally finished!

I might recommend to others looking to get started in compiler design, to review any universities compiler building classes, especially if they can find power points or notes online. This is a nice way to guide yourself through building a compiler. For those who want some keywords to google, search for them in this order. Tokenizing, parsing, object file, linker/loader, and executable format. These are some nice keywords, to look at, and I've listed them in about the order you'll need to learn to have a finished product.

More to come soon!

A book? I hate book. Book is stupid.
(Formerly Yellow)

Login to post a reply

Server time is: 2025-06-05 19:35:44
Your offset time is: 2025-06-05 19:35:44