Creating a Compiler - Tutorial 5 - Syntax Checking
Syntax is the form or structure of a language. It is a series of rules that determine what is and is not a valid sequence of tokens. Semantics is the meaning of a language. It is a series of rules that determine what exactly the meaning of the token sequence is and can be. Together they form what is called the grammar of the computer language.
In this compiler tutorial we will be looking at the former.
Before we begin writing our syntax checking functions it will help to first review the parsing expression grammar we will be using. Our parsing expression grammar is essentially a heirarchy of rules. Each rule has a one to one matching with a syntax checking function. For example, consider the following rule: PUSHOP -> 'PUSH' LIT_NUM. On the left side of the arrow is our function and rule name, PUSHOP. To the right of the arrow is the series of rules that our contained with in our PUSHOP rule. The first rule encountered is the word "PUSH" enclosed in apostrophes. Whenever you see a word in apostrophes in our grammar that means that the token our function is examining must match that word exactly. Afterwards, there is LIT_NUM. Since this isn't surrounded by apostrophes it is the name of another rule in our PUSHOP rule. In this case, it refers to the literal number function which checks to see if the token is a literal number or not. Putting it all together, our token sequence will conform to the PUSHOP rule if it starts with a token containing 'PUSH' followed by a literal number.
Now let's try another. Our next rule is called CALLOP. It consists of the following: CALLOP -> 'CALL' ID. It is quite similar to PUSHOP except for one crucial difference. Instead of LIT_NUM we have ID. ID is a function like LIT_NUM except it checks to see if the token is a valid identifier. You will recall from our lexing tutorial that an identifier is a token that begins with either a letter of any case or the symbol "_".
Observant readers will note that by now we have just defined the syntax for the two assembly instructions in our source program that we have been using for our tutorials so far. However, we haven't really demonstrated how they fit into the heirarchy of the grammar. So without further ado I present to you a more complete grammar:
UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID
The first rule is called the UNARYOP rule. Both arguments of the UNARYOP rule happen to be other syntax rules. When a new syntax rule is encountered it is usually immediately defined below where it was encountered. The lower the rule in a heirarchy, the less it relies on other syntax rules.
Of paticular interest is the occurance of the '|' operator. The '|' operator means that either the argument on the left of '|' or the argument on the right of '|' can be true in order for the rule to hold. So in order for a sequence of tokens to count as a unaryop it must be either a callop or a pushop.
There are many different kinds of operators that can occur in our grammar. They typically occur directly after an argument or in between two arguments. Here is a brief list of some of the most common operators:
Operator: ?
Occurence: Right after an argument.
Meaning: The preceding argument can occur in a rule or it can be omitted. It is optional.
Operator: *
Occurence: Right after an argument.
Meaning: There can be 0 or more occurences of the rule.
Operator: +
Occurence: Right after an argument.
Meaning: There can be 1 or more occurences of the rule.
The last operator in the above list is the only other operator we will be using in our grammar for now. The complete grammar for our assembly program is as follows:
Program -> Line+ EOF
Line -> IMPORTDECL | UNARYOP EOL
IMPORTDECL -> 'IMPORT' STR_LIT 'AS' ID 'FROM' STR_LIT
UNARYOP -> PUSHOP | CALLOP
PUSHOP -> 'PUSH' LIT_NUM
CALLOP -> 'CALL' ID
The Line and IMPORTDECL rules are fairly self explainatory, so let us focus on the first rule, the Program rule. PROGRAM consists of a Line rule and an EOF rule. EOF is rather simple. It stands for end of file. Line is defined below, but it is followed by something we haven't encountered before. The + operator. As explained previous, the + operator means 1 or more instances of the preceding argument can occur. In our case, this means that 1 or more lines can occur in a program. Simple, no?
Computer language design can get quite complicated so having a parsing expression grammar around can help to organize your thoughts and prevent you from making language design errors.
Now that the theory is out of the way we can get into the practical side of our tutorial: the coding.
There are two primary tasks that our syntax checking code must accomplish. The first is to gather any type or declaration information for the semantic checking code which will be explained in a future tutorial. The second is to check the list of tokens to make sure that they are syntacially correct according to our parsing expression grammar. In this tutorial we will deal only with the syntax checking task. The information gathering task will be explained in the next tutorial dealing with semantic checking.
Let's start with our main syntax checking function appropriatly titled SyntaxCheck:
SUB SyntaxCheck(TokenList as List ptr)
Dim N
Dim T as Token ptr
'If we don't have any tokens in our token list
'we don't have anything to syntax check. So exit.
N = TokenList->Number
If N = 0 then Exit SUB
'Start at the first token in our list.
FirstItem(TokenList)
'FOR EACH LOOP
do
'Syntax check the list of tokens for our current line
LineSyntaxCheck(TokenList)
NextItem(TokenList)
loop until IsLastItem(TokenList) = 1
End SUB
SyntaxCheck takes a list header as an argument and checks all tokens in that list. It starts by first doing a safety check to see if there are any tokens actually in the list. Once that is established it moves on to a basic for-each loop, itinerating through each line of tokens in the list. The bulk of the work is done in the LineSyntaxCheck function. Very little will change about the SyntaxCheck function because it is quite basic.
Of note is a new linked list function called 'IsLastItem'. It checks to see if the item in the list is the last item. The code for it is rather simple and should be placed in the Linked List.bas file. It is as follows:
Function IsLastItem(List as List ptr) as uinteger
If List->pCurrent->pNext = 0 then
IsLastItem = 1
Else
IsLastItem = 0
End If
End Function
With that out of the way, we can move on to our LineSyntaxCheck function:
SUB LineSyntaxCheck(TokenList as List ptr)
Dim T as Token ptr
If ImportDecl(TokenList) = 1 then
'Success
ElseIf UnaryOp(TokenList) = 1 then
'Success
Else
'SYNTAX ERROR!
T = GetItemData(TokenList)
Print "SYNTAX ERROR! Invalid Statement. First Token: " + T->Token
Print "Line Number: " + Str(T->LineNumber)
sleep
End
End If
'It's possible that there is a vaild statement but
'it is followed by some form of additional tokens
'Check to see if there is junk at the end of the line.
NextItem(TokenList)
T = GetItemData(TokenList)
If T->Tag.EOL <> 1 then
'Error garbage at the end of the line!
Print "Syntax Error: Garbage at the end of the line.";
Print " Line Number: " + Str(T->LineNumber)
sleep
end
End If
End SUB
The LineSyntaxCheck is structured in a very straight forward way. For each argument in our Line rule we have a function that corresponds to the rule name. Recall that our Line rule in our grammar currently looks like this:
Line -> IMPORTDECL | UNARYOP EOL
In English this would read something like this:
Check to see if our ImportDecl rule passes. If it fails check to see if our UnaryOp rule passes. If both fail then we have a failure of the Line rule. Otherwise check to see if our next token is an End Of Line token. If it is then our Line rule passes. This description closely matches the above code.
Before we get into the code for the ImportDecl and UnaryOp, we should first cover two basic utlitiy functions that will be used in both rules, but aren't defined elsewhere. The first is called "IsID" and basically checks to see if a token is classified as an Identifier or not. The second utility function is called "IsNumber" and is similar to "IsID" only it checks a token to see if it is a number or not.
Utility Functions:
'Is the text a vaild identifier. I.E. does it
'start with a letter or an underscore?
Function IsID(text as zstring ptr) as uinteger
If text[0] = 95 then '_
IsID = 1
exit function
ElseIf text[0] >= 65 and text[0] <= 90 then 'A-Z
IsID = 1
exit function
ElseIf text[0] >= 97 and text[0] <= 122 then 'a - z
IsID = 1
exit function
Else
IsID = 0
exit function
End If
End Function
Function IsNumber(text as zstring ptr) as uinteger
If text[0] >= 48 and text[0] <= 57 then '0-9
IsNumber = 1
exit function
Else
IsNumber = 0
End If
End Function
The code for the ImportDecl rule's function is quite long so we will break it down into parts. The first part is the function declaration and a variable declaration:
Function ImportDecl(MyList as List ptr) as uinteger
'IMPORTDECL -> 'IMPORT' STR_LIT 'AS' ID 'FROM' STR_LIT
Dim T as Token ptr
ImportDecl takes a list header as an argument. This list header is for our token list. The function also returns an unsigned integer. This is used for signaling whether the rule succeeded or failed(True or False, 1 or 0). The S variable that is declared will be a temporary variable that holds a pointer to our current token that is being checked.
Now there are two kinds of checks that will occur with in this function: terminal and non-terminal. Non-terminal checks occur at the beginning before any keywords are encountered. It is possible for these checks to fail, but the program to still be valid. Case in point, so we have a statement that reads PUSH 0, and we run the ImportDecl function on it. Obviously, since the statement is not an import declaration the ImportDecl function should fail. But we want to continue on with our syntax checking so it shouldn't halt the compiler. Thus it can be labeled non-terminal since it allows future syntax checking functions to look over the code to see if it makes sense to them.
Terminal checks, however, must halt the compiler because there can be no alternative. The sequence is guarenteed to be false so there is no need to run any further checks on it. A good example of this would be an import declaration that was missing an 'AS' keyword. After we encounter an 'IMPORT' keyword we can be sure that what follows must be an import declaration. So any failures to hold to the grammar should be viewed as syntax errors and not just a mismatched rule. Halting the compiler and reporting the error then becomes necessary.
Our first non-terminal check looks like this:
'Fetch our token
T = GetItemData(MyList)
'Are we at the end of a line?
'If so we should exit.
If T->Tag.EOL = 1 Then
ImportDecl = 0
Exit Function
End If
'First keyword check.
If UCASE(T->Token) <> "IMPORT" Then
ImportDecl = 0
Exit Function
End If
The first thing our code does is fetch a pointer to the current token so we can read its contents. Then it checks the token tag to make sure that we aren't at the end of a line for some reason. After that comes our first and only non-terminal check. The first thing that should be encountered in a Import Declaration is the 'IMPORT' keyword. If we don't have that then we can't have an 'IMPORT' declaration and we should return false.
After our only non-terminal check we move on to a series of terminal checks for the rest of the statement:
NextItem(MyList)
T = GetItemData(MyList)
If T->Tag.StringLiteral = 0 then
'Check failed.
If T->Tag.EOL = 1 then
Print "Syntax Error! Import Declaration abruptly ended."
Else
Print "Syntax Error! Expecting a string literal."
Print "Found this instead: " + T->Token
End If
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for user input
End
End If
The above piece of code follows a pattern that will soon become familar. First it advances to the next token. Each terminal check is responsible for placing itself on the correct token from where the preceding check left off. Next it will of course fetch a pointer to the data for the token. Afterwards we arrive at the terminal check. For the above check we see if we have a string literal or not. Recall that after our Import keyword there is a string literal that represents the symbol name of the function we want to import.
The question of what to do when an error is encountered is a sticky one. Advanced compilers have the capacity to report the encountered error and all other errors in the program. Accomplishing this is quite complicated and beyond the scope of this tutorial series. A more basic approach is to simply report the first error that is encountered and exit the compiler. We will be taking the simplier approach for this tutorial. It is left up to the reader as an exercise to implement a more advanced approach if they think it is necessary.
The next check is for the keyword 'AS'. It follows the pattern as well.
NextItem(MyList)
T = GetItemData(MyList)
If UCASE(T->Token) <> "AS" Then
'Check failed.
If T->Tag.EOL = 1 then
Print "Syntax Error! Import Declaration abruptly ended."
Else
Print "Syntax Error! Expecting the keyword 'AS'."
Print "Found this instead: " + T->Token
End If
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for user input
End
End If
The following check is for an identifier. This identifier will be the ID of the symbol we will use to call the imported function.
NextItem(MyList)
T = GetItemData(MyList)
If IsID(S->Token) = 0 then
'Check failed.
If T->Tag.EOL = 1 then
Print "Syntax Error! Import Declaration abruptly ended."
Else
Print "Syntax Error! Expecting a name for the imported function."
Print "Found this instead: " + T->Token
End If
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for user input
End
End If
After our identifier we require the keyword 'FROM'.
NextItem(MyList)
T = GetItemData(MyList)
If UCASE(T->Token) <> "FROM" Then
'Check failed.
If T->Tag.EOL = 1 then
Print "Syntax Error! Import Declaration abruptly ended."
Else
Print "Syntax Error! Expecting the keyword 'FROM'."
Print "Found this instead: " + T->Token
End If
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for user input
End
End If
And Finally, we require a string literal that will identify the library that the import declaration is importing from. If this check succeeds then we can exit the import declaration function with success.
NextItem(MyList)
T = GetItemData(MyList)
If T->Tag.StringLiteral = 0 Then
'Check failed.
If T->Tag.EOL = 1 then
Print "Syntax Error! Import Declaration abruptly ended."
Else
Print "Syntax Error! Expecting a string literal."
Print "Found this instead: " + T->Token
End If
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for user input
End
End If
'Success!
ImportDecl = 1
End Function
With our import declaration code out of the way we can move on to the last syntax rule for this tutorial: the UnaryOp rule. Recall that our grammar for the UnaryOp rule looked like this:
UNARYOP -> PUSHOP | CALLOP
The UnaryOp rule consists of two rules lower down the heirarchy: PushOp and UnaryOp. So the rule should look like the following in code:
Function UnaryOp(MyList as List ptr) as uinteger
'UNARYOP -> PUSHOP | CALLOP
If PushOP(MyList) = 1 then
UnaryOp = 1
ElseIf Callop(MyList) = 1 then
UnaryOp = 1
Else
UnaryOp = 0
End If
End Function
The grammar of the PushOp and CallOp rules consists of a keyword then an argument. For the PUSHOP rule the code will look like the following:
Function PushOp(MyList as List ptr) as uinteger
'PUSHOP -> 'PUSH' LIT_NUM
Dim T as Token ptr
'Fetch pointer to token
T = GetItemData(MyList)
'Are we at the end of the line?
'If so we should exit.
If T->Tag.EOL = 1 then
PushOp = 0
Exit Function
End if
'Non-terminal check. If token doesn't match exit to move on to different rule.
T = GetItemData(MyList)
If UCASE(T->Token) <> "PUSH" then
PushOp = 0
exit function
End If
'Terminal Check. Next token must match. Found PUSH keyword.
NextItem(MyList)
T = GetItemData(MyList)
If IsNumber(T->Token) <> 1 then
'Check failed.
Print "Syntax Error! Was expecting a Literal Number after PUSH"
Print "Found this instead: " + T->Token
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for input
end
End If
'Success
PushOp = 1
End Function
For the CallOp rule, the code will look like this:
Function CallOP(MyList as List ptr) as uinteger
'CALLOP -> 'CALL' ID
Dim T as Token ptr
'Fetch pointer to token
T = GetItemData(MyList)
'Are we at the end of the line?
'If so we should exit.
If T->Tag.EOL = 1 then
CallOp = 0
Exit Function
End if
'Non-terminal check. If token doesn't match exit to move on to different rule.
T = GetItemData(MyList)
If UCASE(T->Token) <> "CALL" then
Callop = 0
exit function
End If
'Terminal Check. Next token must match. Found CALL keyword.
NextItem(MyList)
T = GetItemData(MyList)
If IsID(T->Token) <> 1 then
'Check failed
Print "Syntax Error! Was expecting a identifier after CALL"
Print "Found this instead: " + T->Token
Print "Line Number: " + Str(T->LineNumber)
sleep 'Wait for input
end
End If
'Success
Callop = 1
End Function
This concludes Tutorial 5 of our Creating a Compiler series. As always, please feel free to post about any non-working code, or helpful tips for improving the series. Next up, the Parsing Semantics phase and the data structures that support it.