ThinkGeek - Cool Stuff for Geeks and Technophiles

Saturday, May 16, 2009

SPTL: tokens and the symbol table

In the early stages of Jack Crenshaw's excellent Let's Build a Compiler series, his compiler can only handle single-character keywords. It would seem that this is a limitation of all LL(1) parsers, because the parser is only looking at one source character at a time.

There is a way around this limitation. We can preprocess the source code, replacing each element (no matter how long) with a single character. These characters are known as tokens, and the process is known as lexical analysis. We'll call the tool that handles lexical analysis the lexer. The lexer will scan the source code to determine how many characters make up the next element, then it will look up the element in the symbol table to determine the number to return. I'll discuss the SPTL lexer in more detail in a later post.

If we look through SPTL's grammar, we will find these keywords:


break
false
for
foreach
function
if
new
newline
number
pop
print
push
read
repeat
return
shift
true
until
while


and these operators:

+=
&&
@
=
,
.
--
/
/=
==
"
^
&
>=
>
++
{
[
(
<=
<
-
%
*
*=
!
!=
||
\
+
}
]
)
$
;
-=


Before the compiler even looks at the source code, the keywords and symbols will all be pre-loaded into a symbol table, and assigned unique numeric values. The numeric values are the tokens that will be passed to the parser.

But most programs consist of more than keywords and operators. The code will likely contain user-defined names. These names will need to be loaded into the symbol table, too, but we can't load them until we encounter them. Thus our lexer will need to be able to interact with the symbol table, not only to retrieve tokens but to add to the table.

Our symbol table will look something like this:



break 0
false 1
for 2
foreach 3
function 4
if 5
[et cetera]



I'll use Perl's associative array (hash) data structure to build the table. This way, the lexer can simply look up a symbol directly:


if (defined({'lexer'}->{'symbol'}->{'table'}{$token}))
{
# retrieve the value
}


rather than iterating through the values sequentially:


my $i = 0;
while($i < {'lexer'}->{'symbol'}->{'count'} && !$flag) {
# search sequentially until we find the right
# value; if there are 3000 keywords/operators/
# names, it will takean average of 1500
# iterations to find the right one
if({'lexer'}->{'symbol'}->{'table'}{$i} eq $token)
{
# finally!
$flag = 1;
}
$i++;
}


One more thing: You may have noticed that the symbol table includes a 'newline' keyword, even though it's not part of the language. Its purpose is to let the lexer preserve newlines when passing tokens to the parser. This way, the parser can report the line number if there is an error, and confirm the total number of lines parsed.

You can see the Perl module that builds the symbol table here.

Labels: , , ,