Take on C ; (And all the great links on Compiliers)

Author

Message

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 9th Oct 2006 03:22

Link

Creating a Compiler - Tutorial 1 - Linked Lists

One of the most important data structures in a compiler and programing in general is the linked list. It is the basic means with which we store items of data in sequential fashion when we don't know how many of the items we will have ahead of time. It is also the first data structure we will encounter in our compiler.

Our Linked List code uses what's called a list header. A list header stores a pointer to the current item in our list as well as a pointer to the first item in the list, the number of items in the list, and the size of the data held in each item. The following type represents our header:

+ Code Snippet

Type List
    pFirst as ItemH ptr
    pCurrent as ItemH ptr
    Number as uinteger
    Size as uinteger
End Type

The ItemH type is the Item Header which prefaces each item. It contains a pointer to the next item and a pointer to the previous item. So each item in a list would look something like this:

+ Code Snippet

[List Header]--->[Item Header]--->[Item Header]--->[Item Header]
[    3      ]    [           ]<---[           ]<---[           ]
                 [   Data    ]    [   Data    ]    [   Data    ]

Here is the type that represents the item header:

+ Code Snippet

Type ItemH
    pPrev as ItemH ptr
    pNext as ItemH ptr
End Type

Now here is the code we will uses for initializing our linked list:

+ Code Snippet

Function MakeList(SizeOfListItem as uinteger) as List ptr
    Dim MyList as List ptr
    'make list header
    MyList = allocate(Len(List))
    MyList->Size = SizeOfListItem
    'initialize data
    MyList->pFirst = 0
    MyList->pCurrent = 0
    MyList->Number = 0
    MakeList = MyList
End Function

The above function returns a pointer to the list header which will be used by all of our linked list functions. We pass in an unsigned integer which represents the size of the data we will store in our linked list. Here is the code for the add item function:

Sub AddItem(MyList as List ptr)
  Dim pItem as ItemH ptr
  
  'An item's total size in bytes is equal to the size of its header
  'and the size of the data it stores.
  pItem = callocate(MyList->Size + Len(ItemH))
  
  'If we have any items in our list
  If MyList->Number > 0 then
    'Fix up existing items
    'If we have an item ahead of our current one
    If MyList->pCurrent->pNext <> 0 then
      MyList->pCurrent->pNext->pPrev = pItem
    End If
    'Fix up our item to point to the next item.
    pItem->pNext = MyList->pCurrent->pNext
    
    'Make sure our new item points to our current (
    'soon to be previous)item.
    pItem->pPrev = MyList->pCurrent
    
    'Point our current(soon to be previous) item
    'to our next item.
    MyList->pCurrent->pNext = pItem
  Else
    'Update list header. Since we don't have any
    'items in our list this will be our first one
    'so make sure it is pointed to as the first one
    'by the list header
    MyList->pFirst = pItem
    'Make sure our fields are blank
    pItem->pPrev = 0
    pItem->pNext = 0
  End If
  
  'Update list header. Each item that we add becomes
  'the current one
  MyList->pCurrent = pItem
  MyList->Number += 1
End Sub

+ Code Snippet

Sub AddItem(MyList as List ptr)
  Dim pItem as ItemH ptr
  
  'An item's total size in bytes is equal to the size of its header
  'and the size of the data it stores.
  pItem = callocate(MyList->Size + Len(ItemH))
  
  'If we have any items in our list
  If MyList->Number > 0 then
    'Fix up existing items
    'If we have an item ahead of our current one
    If MyList->pCurrent->pNext <> 0 then
      MyList->pCurrent->pNext->pPrev = pItem
    End If
    'Fix up our item to point to the next item.
    pItem->pNext = MyList->pCurrent->pNext
    
    'Make sure our new item points to our current (
    'soon to be previous)item.
    pItem->pPrev = MyList->pCurrent
    
    'Point our current(soon to be previous) item
    'to our next item.
    MyList->pCurrent->pNext = pItem
  Else
    'Update list header. Since we don't have any
    'items in our list this will be our first one
    'so make sure it is pointed to as the first one
    'by the list header
    MyList->pFirst = pItem
    'Make sure our fields are blank
    pItem->pPrev = 0
    pItem->pNext = 0
  End If
  
  'Update list header. Each item that we add becomes
  'the current one
  MyList->pCurrent = pItem
  MyList->Number += 1
End Sub

A good portion of the code is needed to fix up the addresses of any items ahead or behind our current item that we are adding in. Whenever we add an item it becomes our current item. If we were to add a few items, then move to the middle of the list and add another item we would have to fix the address of the previous item to point to our new item and we would have to fix the address of the next item to point our new item.

An Example:

+ Code Snippet

   
Before:
[ A ] --> [ B ] --> [ C ]
[   ] <-- [   ] <-- [   ]
            ^
            | 
         Current

After:
[ A ] --> [ B ] --> [ NEW ] --> [ C ]
[   ] <-- [   ] <-- [     ] <-- [   ]
                       ^
                       |
                    Current

Now that we know how to add items we can move on to editing their data. The following function will get the pointer to the current item in the list:

+ Code Snippet

Function GetItemData(List as List ptr) as any ptr
    If List->pCurrent <> 0 then
        'This ought to increment the pCurrent pointer by the
        'size of the Item header thus placing it at our Item data
        GetItemData = List->pCurrent + 1
    Else
        GetItemData = 0
    End If
End Function

An odd feature of FreeBasic that was introduced sometime around version .13 is that when you add an integer to a pointer it scales the integer to the size of its type. For example say you have the following type:

+ Code Snippet

Type MyType
  A as integer
  B as integer
End Type

And you have a pointer for this type called MyPtr. If you were to add 1 to this pointer you would increment the pointer by 8 since the size of the type is 8 bytes(2 four byte integers). So the following code:

+ Code Snippet

MyPtr = MyPtr + 1

Is equivalent to this in version .12 or less of the FreeBasic compiler:

+ Code Snippet

MyPtr = MyPtr + (1 * SizeOf(MyType))

So when you see the following line of code:

+ Code Snippet

GetItemData = List->pCurrent + 1

What is going on is that the List header is accessing the pointer to our current item and incrementing it by the size of the item header. The pointer is now pointing to the item's data and is returned.

Once we have a series of items we will then need to traverse our list to access them. There are two functions that we use for traversing a list: FirstItem() and NextItem().

Here is the fairly simple code for FirstItem():

+ Code Snippet

Sub FirstItem(List as List ptr)

    List->pCurrent = List->pFirst
End Sub

Here is the code for NextItem():

+ Code Snippet

Sub NextItem(ListH as List ptr)
  'If we have any list items
  If ListH->pCurrent <> 0 then
    'If we have an item after our current one
    'set the current pointer to the next one
    If ListH->pCurrent->pNext <> 0 then
      ListH->pCurrent = ListH->pCurrent->pNext
    End If
  End If  
End Sub

Both are rather self explainatory so lets move on to an actual example of the above functions in action. The following piece of code demostrates how create a item, assign data to it, and read it back:

+ Code Snippet

'assumes linked list file is included

Type MyType
  A as integer
  B as integer
End Type

DIM OurList as List ptr
DIM pData as MyType ptr

'Make our linked list
OurList = MakeList(SizeOf(MyType))

'Create an item in our linked list
AddItem(OurList)

'Assign data to our new item
pData = GetItemData(OurList)
pData->A = 1
pData->B = 2

'Next item
AddItem(OurList)

pData = GetItemData(OurList)
pData->A = 3
pData->B = 4

'Now we go back to the start of the list to read our items
FirstItem(OurList)

pData = GetItemData(OurList)
Print "Item 1: Field A: " & pData->A
Print "Item 1: Field B: " & pData->B

NextItem(OurList)

pData = GetItemData(OurList)
Print "Item 2: Field A: " & pData->A
Print "Item 2: Field B: " & pData->B

sleep
end

Finally, we move on to looping through each item. There are a variety of ways to tour through each item in a list, but we'll stick to the simplest for now.

+ Code Snippet

'For Each loop over items
FirstItem(MyList)

For i = 1 to MyList->Count
  'Code for reading and writing to the items goes here.

Next

The above piece of code will loop through all items in list. It can also skip over the list if there are no items within it.

We will be revisiting this file later, but for now the functions we've laid out will suffice for our purposes.

Back to top

Profile PM

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 9th Oct 2006 03:24

Link

Here are the first two of my tutorials(well first one not counting the intro). I have more finished, but I'll start posting them next week. If you have any questions or think that something is lacking in the tutorials feel free to post. I'll try to answer any questions and edit the tutorials so that they make more sense as I go along.

Back to top

Profile PM

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 16th Oct 2006 07:06

Link

Creating a Compiler - Tutorial 2 - Basic Lexing

Lexing is a process of taking a text file and breaking it up into pieces called "Tokens". These tokens are the atomic pieces of a program that will be read by the parser and turned into our intermediate language. We will be working with these tokens quite a bit as they form the basis for which we will evalutate the syntatical and semantical correctness of our program.

We'll start with the basic form of a token and in later tutorials move on to more complex definitions. The most basic form start with any letter or a '_' followed by any number of alpha-numeric symbols or underscores. This is our standard ID token and it will be used for all of our variable and function names. We'll store our token in the following type:

+ Code Snippet

Type Token
  Token as zstring * 255
  LineNumber as uinteger
End Type

Note: We will be modifing these types later in other tutorials.

Our main lexing function that is called once per file accepts a file name as input and returns a handle to a linked list containing our lexed file. Consequentaly, we will need to include our linked list file at the start of our lexer file.

+ Code Snippet

#include once "Linked List.bas"

With that out of the way we can now move on to our main function:

+ Code Snippet

Function LexFile(FileName as string) as List ptr
    Dim CurrentLine as string
    Dim LineNumber as uinteger
    
    Open FileName For Input as #1
    'Holds lexed tokens
    Dim TokenList as List ptr
    TokenList = MakeList(Len(Token))
    
    'Read off every line from our text file
    Do Until Eof(1) 
      Line Input #1, CurrentLine
      LineNumber += 1
      
      'The goal of the lexer is to strip away all useless information
      'and format the characters in our text file into an easily manipulatable
      'format for the parsing routines of the syntax and semantic checkers.
      LexLine(CurrentLine, TokenList, LineNumber) 

    Loop

    Close 'Close our file
    LexFile = TokenList 'Return our lexed file.
End Function

As you can see, it is quite straight forward. After opening the file, the loop in the middle will read every line of text off of the file and send it to a function called LexLine. The function LexLine is where the majority of the work in our lexer will occur.

Here is the LexLine function in it's entirety:

Sub LexLine(CurrentLine as ubyte ptr, TokenList as List ptr, LineNumber as uinteger)
    Dim TempToken as zstring * 255
    Dim CurrentPosition as uinteger 'Current position in our temporary token
    Dim CurrentChar as ubyte
    Dim LineIndex as uinteger
    
    Dim T as Token ptr 'Temp token data pointer
    
    'If there is nothing on our current line than exit.
    If CurrentLine = 0 then Exit Function

'Scan though our entire line until we hit a null byte
    Do
        CurrentChar = CurrentLine[LineIndex]
        
        'Rules for turning a character string into a token
        
        If CurrentChar >= 65 and CurrentChar <= 90 then 'A-Z
            TempToken[CurrentPosition] = CurrentChar
            CurrentPosition += 1
        ElseIf CurrentChar >= 97 and CurrentChar <= 122 then 'a - z
            TempToken[CurrentPosition] = CurrentChar
            CurrentPosition += 1
        ElseIf CurrentChar >= 48 and CurrentChar <= 57 then '0-9
          CurrentToken[CurrentPos] = CurrentChar
          CurrentPos += 1            
        ElseIf CurrentChar = 32 then 'Space
            'We add this null terminator to our zstring
            'so it can recognize just our token and not
            'any garbage after it.
            TempToken[CurrentPosition] = 0 
            
            'Add a token to our token list and give it our
            'string if there is any thing in our current token
            If TempToken[0] <> 0 then
                AddItem(TokenList)
                T = GetItemData(TokenList)
                T->Token = TempToken
                T->LineNumber = LineNumber
            End If
            
            CurrentPosition = 0 'Reset to beginning. New Token.
        ElseIf CurrentChar = 0 then 'NULL
            'If there is anything in our temp token
            'make it into a token.
            TempToken[CurrentPosition] = 0
            If TempToken[0] <> 0 then 
                AddItem(TokenList)
                T = GetItemData(TokenList)
                T->Token = TempToken
                T->LineNumber = LineNumber
            End If
            
            'Exit function since we've hit the end of the line
            Exit Function
        Else
            'Error. Unknown symbol.
            Print "ERROR. Unknown Symbol: " + Chr$(CurrentChar)
            Print "Character Value: " + Str(CurrentChar)
            Print "Line Number: " + Str(LineNumber)
            sleep
            end
        End If
        
        LineIndex += 1 'Move on to next character in line
        If CurrentPosition = 256 then 'We've hit the limit of our token buffer
            TempToken[256] = 0 'NULL terminate our too-long token for printing
            Print "ERROR. Token exceeds maximum limit of 255 characters."
            Print "Token: " + TempToken
            Print "Line Number: " + Str(LineNumber)
            sleep
            end
        End If
    Loop
    
End Sub

+ Code Snippet

Sub LexLine(CurrentLine as ubyte ptr, TokenList as List ptr, LineNumber as uinteger)
    Dim TempToken as zstring * 255
    Dim CurrentPosition as uinteger 'Current position in our temporary token
    Dim CurrentChar as ubyte
    Dim LineIndex as uinteger
    
    Dim T as Token ptr 'Temp token data pointer
    
    'If there is nothing on our current line than exit.
    If CurrentLine = 0 then Exit Function

    'Scan though our entire line until we hit a null byte
    Do
        CurrentChar = CurrentLine[LineIndex]
        
        'Rules for turning a character string into a token
        
        If CurrentChar >= 65 and CurrentChar <= 90 then 'A-Z
            TempToken[CurrentPosition] = CurrentChar
            CurrentPosition += 1
        ElseIf CurrentChar >= 97 and CurrentChar <= 122 then 'a - z
            TempToken[CurrentPosition] = CurrentChar
            CurrentPosition += 1
        ElseIf CurrentChar >= 48 and CurrentChar <= 57 then '0-9
          CurrentToken[CurrentPos] = CurrentChar
          CurrentPos += 1            
        ElseIf CurrentChar = 32 then 'Space
            'We add this null terminator to our zstring
            'so it can recognize just our token and not
            'any garbage after it.
            TempToken[CurrentPosition] = 0 
            
            'Add a token to our token list and give it our
            'string if there is any thing in our current token
            If TempToken[0] <> 0 then
                AddItem(TokenList)
                T = GetItemData(TokenList)
                T->Token = TempToken
                T->LineNumber = LineNumber
            End If
            
            CurrentPosition = 0 'Reset to beginning. New Token.
        ElseIf CurrentChar = 0 then 'NULL
            'If there is anything in our temp token
            'make it into a token.
            TempToken[CurrentPosition] = 0
            If TempToken[0] <> 0 then 
                AddItem(TokenList)
                T = GetItemData(TokenList)
                T->Token = TempToken
                T->LineNumber = LineNumber
            End If
            
            'Exit function since we've hit the end of the line
            Exit Function
        Else
            'Error. Unknown symbol.
            Print "ERROR. Unknown Symbol: " + Chr$(CurrentChar)
            Print "Character Value: " + Str(CurrentChar)
            Print "Line Number: " + Str(LineNumber)
            sleep
            end
        End If
        
        LineIndex += 1 'Move on to next character in line
        If CurrentPosition = 256 then 'We've hit the limit of our token buffer
            TempToken[256] = 0 'NULL terminate our too-long token for printing
            Print "ERROR. Token exceeds maximum limit of 255 characters."
            Print "Token: " + TempToken
            Print "Line Number: " + Str(LineNumber)
            sleep
            end
        End If
    Loop
    
End Sub

Now let's break it down piece by piece. The function has three parameters. A pointer to our current line which contains the string of characters that we want to turn into a token. A pointer to a list header which will allow us to add our tokens to a linked list. And finally, our current line number. This will be used by our error checking code to alert the programmer to the location of the offending token.

The next batch of code consists of a series of variable declarations that our function will be using. TempToken is the 256 byte buffer(255 characters + the null byte) which we will scan our tokens into as we build them up. We use a buffer because it is much faster than constantly resizing a dynamic string everytime we add a new character it. The downside to this approach is that we are limited to tokens that are less than 256 characters in length. Since this almost never occurs in practice our limit isn't much of a hinderance. However, if it becomes necessary to have tokens of a greater length then it is possible to create a linked list of buffers. It will be left to the reader to carry out that exercise though.

CurrentPosition is the variable that holds the current position in our TempToken where we will write our next character to. It is incremented each time a character is added to our TempToken. It is set to zero(the beginning of the TempToken) each time a token is completed.

CurrentChar is the current character that is extracted from our line with each itineration of our main loop. It is tested against a set of rules to determine whether our lexer will add it to our current token or finialize our current token.

LineIndex is the index into our current line. It is incremented every loop and is used to fetch our CurrentChar.

T is a temporary pointer of type "Token" used when we add a token to our linked list. In order to assign our token to it's item in the list we must first get the pointer to that newly created item. That pointer is stored in T.

After the variable declarations is a single line of safety code. If we have nothing in our current line than we immediately exit. This occurs frequently as every time you hit the enter key you create a null byte. Often times, programmers hit enter repeatedly to space out there code to make it more readable, thus creating many empty lines.

We are now left with our main loop. This single DO-LOOP will itinerate through every byte in our current line and break it up into a series of tokens. The first line of the loop fetches our current character that needs to be examined. After that, our character is tested against a series of rules to determine what the behavior of our lexer will be next.

Let's think of each If or ElseIf statement as a rule. The first two rules are for dealing with capital and lower-case letters in the alphabet. If we encounter them we add them to our TempToken buffer and move our CurrentPosition over to the next byte in our buffer to be written to. Fairly, simple.

The third rule is for dealing with numbers. It is exactly like our first two rules so will move on to our next rule.

The next rule deals with what happens when we hit space. Now a space is a delimiter for our parser. What a delimiter is is a character or in some cases a token that indicates to our parser that we have encounter the end of our current token that we have been building up in our TempToken and it is time to send it to our linked list. We add a null byte to our TempToken buffer so that only our most recent string is read and not any junk after it that we may have accumulated in the course of parsing our file.

We first encounter a conditional statement after that that checks for a null byte as the first character of our TempToken buffer. The reason for this is that it is possible for there to be multiple spaces packed together. After the first space there would be nothing in our TempToken so there would be no point in creating a token which contains precisely nothing.

One of the fundament goals in a lexer is to strip away all useless information and only send what is necessary to the syntax and semantic checker. This makes writing code for the syntax and semantic checker much easier as less scenarios have to be considered and coded for.

Within our conditional statement our four lines of code that create our new item and assign our token and it's linenumber to it. It is rather straight forward so there is no real need to dwell on it.

After that our CurrentPosition variable is reset to the start of our TempToken buffer and that is the end of our rule concerning space characters.

Our final rule is for when we encounter the null byte at the end of our line. The Null byte rule is quite similar to our space character rule. First make sure our current token in our TempToken buffer is properly null terminated. Then if there is anything in our TempToken buffer add it to our token list. Only instead of reseting our CurrentPosition variable we exit our line lexing function because we now have no more line to lex.

At the end of our series of rules is a final else clause that serves as an all purpose error catcher. If our lexer were to encounter a character which it had no rule for it would then throw an error and print a message to the screen detailing the offend character in question and it's position. The sleep command is used as a simple wait-for-key type function. After a key is pressed the program will simply exit. This is all rather basic at the moment and will be changed at a later date to a more suitable error catching piece of code, but for now it will suffice for our purposes.

At the tail-end of our main loop are two importent pieces of code. This first merely increments our line index variable allowing us to fetch the next character that needs to be examined. The second is a test to see if our CurrentPosition variable is greater than the number of characters we can write into our buffer. If the error is detected a simple message is displayed and the program exited.

Now that our simple lexer functions have been covered here is a simple demo program with a test file to put our code in action:

+ Code Snippet

option explicit 

#include once "Lexer.bas"
'Lex our example file
Dim lexlist as list ptr
lexlist = LexFile("Simple Example.txt")

'Print out all of the lexed tokens.
Dim i as uinteger
Dim t as token ptr
FirstItem(lexlist)
For i = 1 to lexlist->Number 'FOR EACH LOOP
    t = GetItemData(lexlist)
    Print t->LineNumber; " "; 'Preface each token with the line it occured on
    Print t->Token
    NextItem(lexlist)    
Next 
sleep 'Wait for key

File to test it on. Named "Simple Example.txt":

+ Code Snippet

Hello world
I am a source file
I contain 3 lines

The program should output the following:

+ Code Snippet

1 Hello
1 World
2 I
2 am
2 a
2 source
2 file
3 I
3 contain
3 3
3 lines

This concludes tutorial 2 in our series. Next up is a more advanced look at lexing where we finally get to lex our first actual assembly program!

Back to top

Profile PM

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 23rd Oct 2006 05:51

Link

Creating a Compiler - Tutorial 3 - Advanced Lexing

Now that the basics have been covered it is time to move on to the more advanced aspects of lexing. In this installment of our tutorial series we'll go over lexing strings, and the end of line character. This will allow us to lex our first complete assembly program. Here is our first functional assembly program.

File named "Source.txt":

+ Code Snippet

IMPORT "__imp__ExitProcess@4" AS ExitProcess FROM "kernel32.lib"

PUSH 0
CALL ExitProcess

The first line of code imports the function ExitProcess from kernel32.lib using "__imp__ExitProcess@4" as the symbol to search for. The actual code for the program is just two instructions. A push instruction which pushes 0 onto the stack. And a CALL instruction which calls the imported function ExitProcess. Basically, the program starts then immediately quits. There is not much to it, but it is a vaild program and windows will run it so it is a start.

There are a lot of new characters introduced in this source file. Let's start with the first new one we encounter: the quote character("). When we encounter a quote mark we want the lexer to include everything after it in the same token regardless of its content up untill we encounter another quote mark. So we will create a new mode for our lexer which skips the normal set of rules and uses a special set for when we encounter a quote character.

First we add the variable "Mode" at the top of our LexLine function:

+ Code Snippet

Dim Mode as uinteger

Then we wrap our rules with a giant if clause like this one:

+ Code Snippet

If Mode = 0 then 'Normal Mode

   'Regular rules are here

ElseIf Mode = 1 then 'String literal Mode

   'Rules when string lexing

End If

Now in the regular rules we add a new rule underneath our space rule:

+ Code Snippet

ElseIf CurrentChar = 34 then '"
    'If there is anything in our current token
    'make it into a token.
    TempToken[CurrentPosition] = 0
    If TempToken[0] <> 0 then
        AddItem(TokenList)
        T = GetItemData(TokenList)
        T->Token = TempToken
        T->LineNumber = LineNumber
    End If
                
    'From here on out we will be 
    'in string parsing mode where everything between
    'this quote and the next is added to the temp token
    'regardless of it's content.
    CurrentPosition = 0
    Mode = 1

This new rule starts out much like our space rule. Whatever is in our TempToken is finialized and sent to our token list if there is anything in it. The CurrentPosition variable is then reset so we can begin a new token. What makes this rule different from the space rule is that one little line at the end where Mode is set to 1. This is our flag variable that lets our lexer know when it is time to enter our string lexing mode.

Now we go down to the ElseIf clause for Mode = 1 that we added earlier. We add the following three rules:

+ Code Snippet

'Stop adding characters to the string literal when we
'hit a quote mark.
If CurrentChar = 34 then '"
    TempToken[CurrentPosition] = 0 'NULL
                
    AddItem(TokenList)
    T = GetItemData(TokenList)
    T->Token = TempToken

    T->LineNumber = LineNumber
                
    CurrentPosition = 0
    Mode = 0 'Return to normal mode
ElseIf CurrentChar = 0 then 'NULL
    'We should hit the end of a line with out hitting
    'another string quote. So throw an error
    Print "ERROR! COULDN'T FIND NEEDED QUOTE BEFORE END OF LINE"
    Print "Line Number: " + Str(LineNumber)
    sleep
    end
Else
    'Add whatever we can find into our current token
    TempToken[CurrentPosition] = CurrentChar
    CurrentPosition +=1              
End If

The first rule simply checks for another quote mark. When it encounters it it takes whatever is in our temptoken and adds it to our token list. After reseting our CurrentPosition variable to point to the beginning of our TempToken buffer it returns the Mode to our lexing mode.

The second rule checks to see if we hit the end of the line. This shouldn't occur because we need a second quote to close off our string literal which means we have an error. An error message is printed and then the program will exit after a key press.

The third rule is just the catch rule. Whatever it encounters in our CurrentChar it will add to our TempToken buffer.

Now that we have our string literal the question arises, "how do we tell it apart from a normal token?" The answer is simple: with tags. Tags are bit flags that allow us to pass information about the tokens to later stages of our compiler. It is a way of pre-calculating certain identifing information for quick retreival later. In this case, we will modify our above code to set a StringLiteral flag so we can determine if we have a string literal or a regular token on our hands when we are parsing.

So let's add our Tag type at the type of our file:

+ Code Snippet

Type TokenInfo
    StringLiteral : 1 as integer
End Type

If it looks a little sparse it's because we haven't added all of our flags in yet. Later on we will add more flags in when it becomes necessary.

Add the following line of code to the bottom of our Token Type:

+ Code Snippet

Tag as TokenInfo

After that we can begin modifying our string lexing rules. The only rule we really need to change is the first one. After this line of code:

+ Code Snippet

T->Token = TempToken

We insert this piece of code:

+ Code Snippet

T->Tag.StringLiteral = 1 'This allows us to precalculate
                         'that this is a string literal.
                         'This will allow us to quickly
                         'determine whether it is a string
                         'literal later.

And that finishes that paticular task.

Moving on to our next lexing task. In parsing there is a need to tell where one statement begins and another ends. In some languages like C there is a special character designated for this job. The ";" character. But in BASIC and assembly the null byte at the end of the line must suffice. Therefore, we need to keep track of the null byte and give it it's own token.

There is a danger in just blindly adding a new token everytime we come across as a new null byte. Frequently, we will encounter multiple lines of blank space between statements. If we add a new token everytime we encounter a null byte, it will look like we have multiple statements consisting of nothing. In our pursuit of minimal wasted tokens, we will have to track how many tokens we add each line. If none we're added when we go to create a token for our null byte than we don't add a null token. This keeps the number of null tokens down to one per statement.

Add this declaration to the top of our LexLine function:

+ Code Snippet

Dim NumberOfTokens as uinteger

Now, each time we add a new token, place this line of code after it:

+ Code Snippet

NumberOfTokens += 1

Simple, no?

We'll also add an EOL(end of line) flag to our tag.

+ Code Snippet

Now rewrite our NULL byte rule so it looks like this:

+ Code Snippet

ElseIf CurrentChar = 0 then 'NULL
    'If there is anything in our temp token
    'make it into a token.
    TempToken[CurrentPosition] = 0
    If TempToken[0] <> 0 then 
        AddItem(TokenList)
        T = GetItemData(TokenList)
        T->Token = TempToken
        T->LineNumber = LineNumber
        NumberOfTokens += 1
    End If
                
    'If we haven't made any tokens yet, that means there
    'was nothing in this line. So don't bother with creating
    'another end of line token
    If NumberOfTokens > 0 then
        'Add a special end of line token to mark where the
        'end of our lines occured
        AddItem(TokenList)
        T = GetItemData(TokenList)
        T->Tag.EOL = 1
        T->LineNumber = LineNumber
        NumberOfTokens += 1
    End If
                
'Exit function since we've hit the end of the line
Exit Function

That finishes this tutorials additions to our compiler. Below is a small demo program that you can use to test the new additions to the compiler.

+ Code Snippet

'Lexer Demo 2

option explicit 

#include once "Lexer.bas"
'Lex our example file
Dim lexlist as list ptr
lexlist = LexFile("Source.txt")

'Print out all of the lexed tokens.
Dim i as uinteger
Dim t as token ptr
FirstItem(lexlist)
For i = 1 to lexlist->Number 'FOR EACH LOOP
    t = GetItemData(lexlist)
    Print t->LineNumber; " "; 'Preface each token with the line it occured on
    If t->Tag.StringLiteral = 1 then Print "'"; 'If string literal preface w/ '
    Print t->Token;
    If t->Tag.EOL then Print "End Of Line"; 'If End Of Line token print this
    If t->Tag.StringLiteral = 1 then Print "'"; 'If string literal wrap w/ '
    Print "" 'New line
    NextItem(lexlist)    
Next 
sleep 'Wait for key

Should output:

+ Code Snippet

1 IMPORT
1 '__imp__ExitProcess@4'
1 AS
1 ExitProcess
1 FROM
1 'kernel32.lib'
1 End Of Line
3 PUSH
3 0
3
4 CALL
4 ExitProcess
4

Back to top

Profile PM

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 30th Oct 2006 13:23

Link

Creating a Compiler - Tutorial 4 - Symbol Table

Before we move on to Syntax Checking we must first learn one of the most pivitol data structures in a compiler other than a linked list: a symbol table. A symbol table is a data structure that allows you to store information by a "key." This is similar to how an array allows you to store information by an integer index number. The "key" for our symbol tables will be a string that holds an identifier. Each symbol table holds a different kind of indentifier and it's associated information.

Let's start with the initialization of the symbol table. Our symbol table will use a linked list to hold the necessary information. This is a simple and straight forward implementation of a symbol table. However, it is not very optimal. Right now we'll focus on simplicity and functionality and get to performance later.

Top of the symbol table file:

+ Code Snippet

#include once "linked list.bas"

Initialization function:

+ Code Snippet

Type Table
    Offset as uinteger
    SymbolList as List ptr
End Type

Function MakeSymbolTable(Size as uinteger, KeyOffset as uinteger) as Table ptr
    Dim MyTable as Table ptr
    MyTable = callocate(Len(Table))
    'We store our key in the datatype that the symbol table holds
    MyTable->Offset = KeyOffset
    MyTable->SymbolList = MakeList(Size)

    MakeSymbolTable = MyTable
End Function

A brief explanation of the Table type is in order. Offset is the offset into the symbol datatype which holds our key. SymbolList is a pointer to the list header for all of our symbols. Each item in that symbol list share a single datatype designed for the paticular identifer that our symbol table will hold. For example, say we want to have a symbol table to hold all of our imported functions which we will store by name. So we create a type called ImportFunc which includes a field called FuncName which will hold our imported function's name. When we create our symbol table we pass two pieces of information. The first is the size of our type which in our example is the size of ImportedFunc. The second piece of information is the offset into that type where our symbol key will be stored which in this case is FuncName.

So you would see something like the following line of code when initializing your symbol tables:

+ Code Snippet

ImportedFunctions = MakeSymbolTable(SizeOf(ImportedFunc), OffsetOf(ImportedFunc, FuncName))

Adding symbols to your symbol tables is an easy process. Just pass in the pointer to your symbol table header and the new key to your symbol.

All that the AddSymbol function does internally is add another item to the SymbolList and fill in part of the item data with the key.

+ Code Snippet

SUB AddSymbol(MyTable as Table ptr, Key as zstring ptr)
    AddItem(MyTable->SymbolList)
    'Get symbol data pointer and point it to a place within our stored datatype that will hold
    'the symbol's key.
    Dim pData as ubyte ptr
    pData = GetItemData(MyTable->SymbolList)
    pData += MyTable->Offset
    'Insert Key into item's data type.
    *cptr(zstring ptr, pData) = *Key

End SUB

Finding a symbol in a symbol table is very straight forward. You merely pass the pointer to your symbol table header and the key you are searching for.

Internally, all that the FindSymbol does is cycle through the linked list trying to match the key to any symbols in the list.

+ Code Snippet

Function FindSymbol(MyTable as Table ptr, Key as zstring ptr) as uinteger
    Dim i as uinteger, pData as ubyte ptr

    FirstItem(MyTable->SymbolList)
    For i = 1 to MyTable->SymbolList->Number
        pData = GetItemData(MyTable->SymbolList)

        pData += MyTable->Offset

        If *Key = *cptr(zstring ptr, pData) then
            FindSymbol = 1
            Exit Function
        End If

        NextItem(MyTable->SymbolList)
    Next    
    FindSymbol = 0
End Function

Getting the symbol data is just like getting the item data from an item in a linked list. In fact, the GetSymbolData function is just a wrapper for the GetItemData function.

+ Code Snippet

Function GetSymbolData(MyTable as Table ptr) as any ptr
   GetSymbolData = GetItemData(MyTable->SymbolList)
End Function

That covers the basic of our symbol table functions. Below is a small demo program to demonstrate our new symbol table functions.

Symbol Table Demo program:

+ Code Snippet

#include once "Symbol Table.bas"

Type t1
    a as uinteger
    key as zstring * 255
    b as uinteger
End Type

Dim MyTable as Table ptr

MyTable = MakeSymbolTable(Len(t1), OffsetOf(t1, key))

AddSymbol(MyTable, "Hello")
AddSymbol(MyTable, "World")

Dim Success as uinteger
Success = FindSymbol(MyTable, "Hello")

Print Success
sleep

Back to top

Profile PM

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 1st Jan 2007 00:43

Link

Wow Neophyte, although I'm months late in finding this, this is excellent to see all the information you've put up. I'm going to spend the rest of my holiday break, breaking this down.

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

TKF15H

21

Years of Service

User Offline

Joined: 20th Jul 2003

Location: Rio de Janeiro

Posted: 1st Jan 2007 01:00

Link

Whoa! I hadn't seen this either! o_O
*copies*

My Blog

Back to top

Profile PM Email Website

Three Score

20

Years of Service

User Offline

Joined: 18th Jun 2004

Location: behind you

Posted: 1st Jan 2007 08:50

Link

Me neither!!

/me compiles it into a nice text file I can access offline

JouleOS and friends
FREE HOSTING! FREE virtual serial ports!

Back to top

Profile PM Email Website

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 22nd Jan 2007 05:32 Edited at: 22nd Jan 2007 05:34

Link

Creating a Compiler - Tutorial 5 - Syntax Checking

Syntax is the form or structure of a language. It is a series of rules that determine what is and is not a valid sequence of tokens. Semantics is the meaning of a language. It is a series of rules that determine what exactly the meaning of the token sequence is and can be. Together they form what is called the grammar of the computer language.

In this compiler tutorial we will be looking at the former.

Before we begin writing our syntax checking functions it will help to first review the parsing expression grammar we will be using. Our parsing expression grammar is essentially a heirarchy of rules. Each rule has a one to one matching with a syntax checking function. For example, consider the following rule: PUSHOP -> 'PUSH' LIT_NUM. On the left side of the arrow is our function and rule name, PUSHOP. To the right of the arrow is the series of rules that our contained with in our PUSHOP rule. The first rule encountered is the word "PUSH" enclosed in apostrophes. Whenever you see a word in apostrophes in our grammar that means that the token our function is examining must match that word exactly. Afterwards, there is LIT_NUM. Since this isn't surrounded by apostrophes it is the name of another rule in our PUSHOP rule. In this case, it refers to the literal number function which checks to see if the token is a literal number or not. Putting it all together, our token sequence will conform to the PUSHOP rule if it starts with a token containing 'PUSH' followed by a literal number.

Now let's try another. Our next rule is called CALLOP. It consists of the following: CALLOP -> 'CALL' ID. It is quite similar to PUSHOP except for one crucial difference. Instead of LIT_NUM we have ID. ID is a function like LIT_NUM except it checks to see if the token is a valid identifier. You will recall from our lexing tutorial that an identifier is a token that begins with either a letter of any case or the symbol "_".

Observant readers will note that by now we have just defined the syntax for the two assembly instructions in our source program that we have been using for our tutorials so far. However, we haven't really demonstrated how they fit into the heirarchy of the grammar. So without further ado I present to you a more complete grammar:

UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID

The first rule is called the UNARYOP rule. Both arguments of the UNARYOP rule happen to be other syntax rules. When a new syntax rule is encountered it is usually immediately defined below where it was encountered. The lower the rule in a heirarchy, the less it relies on other syntax rules.

Of paticular interest is the occurance of the '|' operator. The '|' operator means that either the argument on the left of '|' or the argument on the right of '|' can be true in order for the rule to hold. So in order for a sequence of tokens to count as a unaryop it must be either a callop or a pushop.

There are many different kinds of operators that can occur in our grammar. They typically occur directly after an argument or in between two arguments. Here is a brief list of some of the most common operators:

Operator: ?
Occurence: Right after an argument.
Meaning: The preceding argument can occur in a rule or it can be omitted. It is optional.

Operator: *
Occurence: Right after an argument.
Meaning: There can be 0 or more occurences of the rule.

Operator: +
Occurence: Right after an argument.
Meaning: There can be 1 or more occurences of the rule.

The last operator in the above list is the only other operator we will be using in our grammar for now. The complete grammar for our assembly program is as follows:

Program -> Line+ EOF
Line -> IMPORTDECL | UNARYOP EOL
IMPORTDECL -> 'IMPORT' STR_LIT 'AS' ID 'FROM' STR_LIT
UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID

The Line and IMPORTDECL rules are fairly self explainatory, so let us focus on the first rule, the Program rule. PROGRAM consists of a Line rule and an EOF rule. EOF is rather simple. It stands for end of file. Line is defined below, but it is followed by something we haven't encountered before. The + operator. As explained previous, the + operator means 1 or more instances of the preceding argument can occur. In our case, this means that 1 or more lines can occur in a program. Simple, no?

Computer language design can get quite complicated so having a parsing expression grammar around can help to organize your thoughts and prevent you from making language design errors.

Now that the theory is out of the way we can get into the practical side of our tutorial: the coding.

There are two primary tasks that our syntax checking code must accomplish. The first is to gather any type or declaration information for the semantic checking code which will be explained in a future tutorial. The second is to check the list of tokens to make sure that they are syntacially correct according to our parsing expression grammar. In this tutorial we will deal only with the syntax checking task. The information gathering task will be explained in the next tutorial dealing with semantic checking.

Let's start with our main syntax checking function appropriatly titled SyntaxCheck:

+ Code Snippet

SUB SyntaxCheck(TokenList as List ptr)
    Dim N
    Dim T as Token ptr

    'If we don't have any tokens in our token list
    'we don't have anything to syntax check. So exit.
    N = TokenList->Number
    If N = 0 then Exit SUB
    
    
    'Start at the first token in our list.
    FirstItem(TokenList)
    'FOR EACH LOOP
    do
        'Syntax check the list of tokens for our current line
        LineSyntaxCheck(TokenList)
        
        NextItem(TokenList)
    loop until IsLastItem(TokenList) = 1
    
End SUB

SyntaxCheck takes a list header as an argument and checks all tokens in that list. It starts by first doing a safety check to see if there are any tokens actually in the list. Once that is established it moves on to a basic for-each loop, itinerating through each line of tokens in the list. The bulk of the work is done in the LineSyntaxCheck function. Very little will change about the SyntaxCheck function because it is quite basic.

Of note is a new linked list function called 'IsLastItem'. It checks to see if the item in the list is the last item. The code for it is rather simple and should be placed in the Linked List.bas file. It is as follows:

+ Code Snippet

Function IsLastItem(List as List ptr) as uinteger
    If List->pCurrent->pNext = 0 then
        IsLastItem = 1
    Else
        IsLastItem = 0
    End If
End Function

With that out of the way, we can move on to our LineSyntaxCheck function:

+ Code Snippet

SUB LineSyntaxCheck(TokenList as List ptr)
    Dim T as Token ptr
    
    If ImportDecl(TokenList) = 1 then
        'Success
    ElseIf UnaryOp(TokenList) = 1 then
        'Success
    Else
        'SYNTAX ERROR!
        T = GetItemData(TokenList)
        Print "SYNTAX ERROR! Invalid Statement. First Token: " + T->Token
        Print "Line Number: " + Str(T->LineNumber)
        sleep
        End
    End If

    'It's possible that there is a vaild statement but 
    'it is followed by some form of additional tokens
    'Check to see if there is junk at the end of the line.
    NextItem(TokenList)
    T = GetItemData(TokenList)
    If T->Tag.EOL <> 1 then
        'Error garbage at the end of the line!
        Print "Syntax Error: Garbage at the end of the line.";
        Print " Line Number: " + Str(T->LineNumber)
        sleep
        end
    End If    
End SUB

The LineSyntaxCheck is structured in a very straight forward way. For each argument in our Line rule we have a function that corresponds to the rule name. Recall that our Line rule in our grammar currently looks like this:

Line -> IMPORTDECL | UNARYOP EOL

In English this would read something like this:

Check to see if our ImportDecl rule passes. If it fails check to see if our UnaryOp rule passes. If both fail then we have a failure of the Line rule. Otherwise check to see if our next token is an End Of Line token. If it is then our Line rule passes. This description closely matches the above code.

Before we get into the code for the ImportDecl and UnaryOp, we should first cover two basic utlitiy functions that will be used in both rules, but aren't defined elsewhere. The first is called "IsID" and basically checks to see if a token is classified as an Identifier or not. The second utility function is called "IsNumber" and is similar to "IsID" only it checks a token to see if it is a number or not.

Utility Functions:

+ Code Snippet

'Is the text a vaild identifier. I.E. does it
'start with a letter or an underscore?
Function IsID(text as zstring ptr) as uinteger
    If text[0] = 95 then '_
        IsID = 1
        exit function
    ElseIf text[0] >= 65 and text[0] <= 90 then 'A-Z
        IsID = 1
        exit function
    ElseIf text[0] >= 97 and text[0] <= 122 then 'a - z
        IsID = 1
        exit function
    Else
        IsID = 0
        exit function
    End If
End Function

Function IsNumber(text as zstring ptr) as uinteger
    If text[0] >= 48 and text[0] <= 57 then '0-9
        IsNumber = 1
        exit function
    Else
        IsNumber = 0
    End If
End Function

The code for the ImportDecl rule's function is quite long so we will break it down into parts. The first part is the function declaration and a variable declaration:

+ Code Snippet

Function ImportDecl(MyList as List ptr) as uinteger
    'IMPORTDECL -> 'IMPORT' STR_LIT 'AS' ID 'FROM' STR_LIT
    Dim T as Token ptr

ImportDecl takes a list header as an argument. This list header is for our token list. The function also returns an unsigned integer. This is used for signaling whether the rule succeeded or failed(True or False, 1 or 0). The S variable that is declared will be a temporary variable that holds a pointer to our current token that is being checked.

Now there are two kinds of checks that will occur with in this function: terminal and non-terminal. Non-terminal checks occur at the beginning before any keywords are encountered. It is possible for these checks to fail, but the program to still be valid. Case in point, so we have a statement that reads PUSH 0, and we run the ImportDecl function on it. Obviously, since the statement is not an import declaration the ImportDecl function should fail. But we want to continue on with our syntax checking so it shouldn't halt the compiler. Thus it can be labeled non-terminal since it allows future syntax checking functions to look over the code to see if it makes sense to them.

Terminal checks, however, must halt the compiler because there can be no alternative. The sequence is guarenteed to be false so there is no need to run any further checks on it. A good example of this would be an import declaration that was missing an 'AS' keyword. After we encounter an 'IMPORT' keyword we can be sure that what follows must be an import declaration. So any failures to hold to the grammar should be viewed as syntax errors and not just a mismatched rule. Halting the compiler and reporting the error then becomes necessary.

Our first non-terminal check looks like this:

+ Code Snippet

    'Fetch our token
    T = GetItemData(MyList)

    'Are we at the end of a line?
    'If so we should exit.
    If T->Tag.EOL = 1 Then
        ImportDecl = 0
        Exit Function
    End If

    'First keyword check.
    If UCASE(T->Token) <> "IMPORT" Then
        ImportDecl = 0
        Exit Function
    End If

The first thing our code does is fetch a pointer to the current token so we can read its contents. Then it checks the token tag to make sure that we aren't at the end of a line for some reason. After that comes our first and only non-terminal check. The first thing that should be encountered in a Import Declaration is the 'IMPORT' keyword. If we don't have that then we can't have an 'IMPORT' declaration and we should return false.

After our only non-terminal check we move on to a series of terminal checks for the rest of the statement:

+ Code Snippet

    NextItem(MyList)

    T = GetItemData(MyList)

    If T->Tag.StringLiteral = 0 then
        'Check failed.
        If T->Tag.EOL = 1 then
            Print "Syntax Error! Import Declaration abruptly ended."
        Else
            Print "Syntax Error! Expecting a string literal."
            Print "Found this instead: " + T->Token
        End If
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for user input
        End
    End If

The above piece of code follows a pattern that will soon become familar. First it advances to the next token. Each terminal check is responsible for placing itself on the correct token from where the preceding check left off. Next it will of course fetch a pointer to the data for the token. Afterwards we arrive at the terminal check. For the above check we see if we have a string literal or not. Recall that after our Import keyword there is a string literal that represents the symbol name of the function we want to import.

The question of what to do when an error is encountered is a sticky one. Advanced compilers have the capacity to report the encountered error and all other errors in the program. Accomplishing this is quite complicated and beyond the scope of this tutorial series. A more basic approach is to simply report the first error that is encountered and exit the compiler. We will be taking the simplier approach for this tutorial. It is left up to the reader as an exercise to implement a more advanced approach if they think it is necessary.

The next check is for the keyword 'AS'. It follows the pattern as well.

+ Code Snippet

    NextItem(MyList)

    T = GetItemData(MyList)

    If UCASE(T->Token) <> "AS" Then
        'Check failed.
        If T->Tag.EOL = 1 then
            Print "Syntax Error! Import Declaration abruptly ended."
        Else
            Print "Syntax Error! Expecting the keyword 'AS'."
            Print "Found this instead: " + T->Token
        End If
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for user input
        End
    End If

The following check is for an identifier. This identifier will be the ID of the symbol we will use to call the imported function.

+ Code Snippet

    NextItem(MyList)

    T = GetItemData(MyList)

    If IsID(S->Token) = 0 then
        'Check failed.
        If T->Tag.EOL = 1 then
            Print "Syntax Error! Import Declaration abruptly ended."
        Else
            Print "Syntax Error! Expecting a name for the imported function."
            Print "Found this instead: " + T->Token
        End If
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for user input
        End
    End If

After our identifier we require the keyword 'FROM'.

+ Code Snippet

    NextItem(MyList)
    
    T = GetItemData(MyList)
    
    If UCASE(T->Token) <> "FROM" Then
        'Check failed.
        If T->Tag.EOL = 1 then
            Print "Syntax Error! Import Declaration abruptly ended."
        Else
            Print "Syntax Error! Expecting the keyword 'FROM'."
            Print "Found this instead: " + T->Token
        End If
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for user input
        End
    End If

And Finally, we require a string literal that will identify the library that the import declaration is importing from. If this check succeeds then we can exit the import declaration function with success.

+ Code Snippet

    
    NextItem(MyList)
    
    T = GetItemData(MyList)
    
    If T->Tag.StringLiteral = 0 Then
        'Check failed.
        If T->Tag.EOL = 1 then
            Print "Syntax Error! Import Declaration abruptly ended."
        Else
            Print "Syntax Error! Expecting a string literal."
            Print "Found this instead: " + T->Token
        End If
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for user input
        End
    End If

    'Success!
    ImportDecl = 1
    
End Function

With our import declaration code out of the way we can move on to the last syntax rule for this tutorial: the UnaryOp rule. Recall that our grammar for the UnaryOp rule looked like this:
UNARYOP -> PUSHOP | CALLOP

The UnaryOp rule consists of two rules lower down the heirarchy: PushOp and UnaryOp. So the rule should look like the following in code:

+ Code Snippet

Function UnaryOp(MyList as List ptr) as uinteger
    'UNARYOP -> PUSHOP | CALLOP
    If PushOP(MyList) = 1 then
        UnaryOp = 1
    ElseIf Callop(MyList) = 1 then
        UnaryOp = 1
    Else
        UnaryOp = 0
    End If
End Function

The grammar of the PushOp and CallOp rules consists of a keyword then an argument. For the PUSHOP rule the code will look like the following:

+ Code Snippet

Function PushOp(MyList as List ptr) as uinteger
    'PUSHOP -> 'PUSH' LIT_NUM
    Dim T as Token ptr

    'Fetch pointer to token
    T = GetItemData(MyList)
    
    'Are we at the end of the line?
    'If so we should exit.
    If T->Tag.EOL = 1 then
        PushOp = 0
        Exit Function
    End if

    'Non-terminal check. If token doesn't match exit to move on to different rule.
    T = GetItemData(MyList)
    
    If UCASE(T->Token) <> "PUSH" then
        PushOp = 0
        exit function
    End If

    'Terminal Check. Next token must match. Found PUSH keyword.
    NextItem(MyList)

    T = GetItemData(MyList)
    
    If IsNumber(T->Token) <> 1 then
        'Check failed.
        Print "Syntax Error! Was expecting a Literal Number after PUSH"
        Print "Found this instead: " + T->Token
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for input
        end
    End If
    
    'Success
    PushOp = 1
End Function

For the CallOp rule, the code will look like this:

+ Code Snippet

Function CallOP(MyList as List ptr) as uinteger
    'CALLOP -> 'CALL' ID
    Dim T as Token ptr

    'Fetch pointer to token
    T = GetItemData(MyList)
    
    'Are we at the end of the line?
    'If so we should exit.
    If T->Tag.EOL = 1 then
        CallOp = 0
        Exit Function
    End if
    
    'Non-terminal check. If token doesn't match exit to move on to different rule.
    T = GetItemData(MyList)

    If UCASE(T->Token) <> "CALL" then
        Callop = 0
        exit function
    End If
    
    'Terminal Check. Next token must match. Found CALL keyword.    
    NextItem(MyList)

    T = GetItemData(MyList)
    
    If IsID(T->Token) <> 1 then
        'Check failed
        Print "Syntax Error! Was expecting a identifier after CALL"
        Print "Found this instead: " + T->Token
        Print "Line Number: " + Str(T->LineNumber)
        sleep 'Wait for input
        end
    End If
    
    'Success
    Callop = 1
End Function

This concludes Tutorial 5 of our Creating a Compiler series. As always, please feel free to post about any non-working code, or helpful tips for improving the series. Next up, the Parsing Semantics phase and the data structures that support it.

Back to top

Profile PM

Neophyte

22

Years of Service

User Offline

Joined: 23rd Feb 2003

Location: United States

Posted: 22nd Jan 2007 05:40

Link

Sorry for the delay between posts in the series. I've been busy with work and this tutorial was proving harder to organize and write than I anticipated. On the plus side, I've reevaluated some of the early code that I wrote and improved it for this tutorial. It will be making it's way into my compiler eventually.

The next few tutorials should be much quicker to write up until I get to the part about explaining how to write a COFF object file. I'm dreading having to explain that part of my code.

Also, let me know if some of this code isn't compiling. It is somewhat different that what I have in my current working version. I've made quite a few minor changes and I'm not 100% sure that it is in sync with what I have working on file.

Back to top

Profile PM

TKF15H

21

Years of Service

User Offline

Joined: 20th Jul 2003

Location: Rio de Janeiro

Posted: 22nd Jan 2007 15:05

Link

Really good work!
I'm waiting for what comes next, my compiler already gets code and turns it into an ARM program, I'd like to see the x86 equivalent. ^_^
Keep it up!

My Blog

Back to top

Profile PM Email Website

PowerSoft

20

Years of Service

User Offline

Joined: 10th Oct 2004

Location: United Kingdom

Posted: 25th Mar 2007 19:21 Edited at: 25th Mar 2007 19:22

Link

So, a few years on...hows everyone's efforts going? I've started to get some design doc's written..

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock

Back to top

Profile PM Email Website

Kevin Picone

22

Years of Service

User Offline

Joined: 27th Aug 2002

Location: Australia

Posted: 25th Mar 2007 20:45 Edited at: 11th Aug 2010 22:13

Link

Pretty good...

New PlayBasic Learning Edition Released 24th April 2010

Back to top

Profile PM Website

PowerSoft

20

Years of Service

User Offline

Joined: 10th Oct 2004

Location: United Kingdom

Posted: 25th Mar 2007 21:12

Link

Ah yes, PlayBASIC...

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock

Back to top

Profile PM Email Website

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 25th Mar 2007 22:25

Link

I've been extremely busy with school, so progress is pretty much at a hault. I'll probably have more time during spring break as it approaches. No progress has been lost at least, I have all the resources I need, it's really just a matter of building it now.

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

Kentaree

22

Years of Service

User Offline

Joined: 5th Oct 2002

Location: Clonmel, Ireland

Posted: 26th Mar 2007 13:42

Link

I wrote a simple compiler that integrates with GCC for my college project about a year and a half ago, and I'm designing another language at the moment which will hopefully be independant

Nephin Games
Devhat Developer Network and IRC Channel

Back to top

Profile PM Email Website

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 12th Jun 2007 06:49

Link

Still adding some links to the beast. Finishing this compiler and doing some other 3D work are my main summer projects as far as coding goes. Sooner or later, it will be finished.

http://www.avhohlov.narod.ru/p9800en.htm
Pretty good site with lots of compilers/interpreters with source. A few of them are under 1000 lines as well, so a good starting point for beginners.

http://www.freetechbooks.com/forum-14.html
Many many books on compiler construction

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

Three Score

20

Years of Service

User Offline

Joined: 18th Jun 2004

Location: behind you

Posted: 13th Jun 2007 08:44

Link

wow...this thread is still alive!? lol

I've wanted to build a compiler quite a few times..though I just don't see the point when I know C and C++..

Open86 --My Emulator (now with it's first super alpha release
I'm addicted to placebo's...I would quit but it wouldn't mean anything! lol

Back to top

Profile PM Email Website

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 13th Jun 2007 19:50

Link

Yup, this thread will remain alive as long as I'm an active forum user(which has been, roughly 5 years, so it's not going to die soon). I'm convinced that this thread alone is one of the best modern resources on the internet for compiler building.

Quote: "though I just don't see the point when I know C and C++.."

The chances of us making a compiler more optimized than either C or C++ compilers on the market are slim. That's not necessarily the point though. Think of DBP. It's compiler has been built specifically for game creation, making game creation a very simple process for users like us.

The most important part of compiler design however, would have to be parsing. That is something that can be applied to many many things, other than just compilers.

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

PowerSoft

20

Years of Service

User Offline

Joined: 10th Oct 2004

Location: United Kingdom

Posted: 13th Jun 2007 21:18

Link

Also it's good to know what makes things tick to enable you to use them in an even better way.

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock

Back to top

Profile PM Email Website

PowerSoft

20

Years of Service

User Offline

Joined: 10th Oct 2004

Location: United Kingdom

Posted: 27th Aug 2007 12:04

Link

Sorry, finger slipped.

The Innuendo's, 4 Piece Indie Rock Band
http://theinnuendos.tk:::http://myspace.com/theinnuendosrock

Back to top

Profile PM Email Website

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 1st Apr 2009 06:13

Link

Just a small update. I haven't given up on this very very long term goal. I am taking a compiler building class at university, and will post my knowledge and maybe a finished product after I've completed the course. Keep on working on all of your compilers, and don't give up!

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

Kevin Picone

22

Years of Service

User Offline

Joined: 27th Aug 2002

Location: Australia

Posted: 1st Apr 2009 06:45 Edited at: 11th Aug 2010 22:13

Link

Wow, talk about a blast from the past..

New PlayBasic Learning Edition Released 24th April 2010

Back to top

Profile PM Website

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 1st Apr 2009 07:14

Link

Heh, yup, and we're all still here. I'm convinced that this thread is one of the most valuable resources on the internet for compiler design. Hope everyone can give their updates on their progress if they're still leering around.

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

MIDN90

16

Years of Service

User Offline

Joined: 11th Mar 2009

Location: Colville, Washington

Posted: 2nd Apr 2009 05:42

Link

College?

Back to top

Profile PM

JoelJ

21

Years of Service

User Offline

Joined: 8th Sep 2003

Location: UTAH

Posted: 2nd Apr 2009 06:06

Link

Quote: "College? "

What?

Your mother has been erased by a mod because it's larger than 600x120

Back to top

Profile PM Website

MIDN90

16

Years of Service

User Offline

Joined: 11th Mar 2009

Location: Colville, Washington

Posted: 2nd Apr 2009 06:07

Link

Is MikeS in college?

Back to top

Profile PM

MikeS

Retired Moderator

22

Years of Service

User Offline

Joined: 2nd Dec 2002

Location: United States

Posted: 4th Jun 2009 05:52

Link

Haha, yes I am indeed in college. As of today I have completed my compiler for an assembly based language. It's been quite the journey, and getting started early(4ish years ago) helped me really cruise threw this course.

I am going to continue to pursue some compiler design classes in college and develop some more advanced languages.

After I finish this term, I may write a brief tutorial on a simple compiler or interpreter now that I understand all of the pieces.

I haven't any designs on my own compiler, but I may toy around with some ideas, or at least build some tools to assist myself and others in writing their own compilers. I hope everyone elses projects are still coming along. It took me longer than expected, but I finally finished!

I might recommend to others looking to get started in compiler design, to review any universities compiler building classes, especially if they can find power points or notes online. This is a nice way to guide yourself through building a compiler. For those who want some keywords to google, search for them in this order. Tokenizing, parsing, object file, linker/loader, and executable format. These are some nice keywords, to look at, and I've listed them in about the order you'll need to learn to have a finished product.

More to come soon!

A book? I hate book. Book is stupid.
(Formerly Yellow)

Back to top

Profile PM Email

Sorry your browser is not supported!

Geek Culture / Take on C ; (And all the great links on Compiliers)