ThinkGeek - Cool Stuff for Geeks and Technophiles

Wednesday, March 18, 2009

SPTL grammar: statements

Now that we've looked at FIRST and FOLLOW sets, and the grammar restrictions imposed by an LL(1) parser, we're ready to look at SPTL statement syntax. My first draft of the grammar featured ambiguities that a recursive descent parser simply cannot handle.

For example, consider these definitions:


statement = addstatement | assignstatement | breakstatement | callstatement
divstatement | foreachstatement | forstatement | ifstatement
multstatement | popstatement | postdecstatement | postincstatement
predecstatement | preincstatement | printstatement |
pushstatement | readstatement | repeatestatement |
shiftstatement | substatement | whilestatement

addstatement = varname '+=' expr ';'

substatement = varname '-=' expr ';'

multstatement = varname '*=' expr ';'

divstatement = varname '/=' expr ';'


An LL(1) parser should be able to determine, as soon as it sees a token, what kind of element it is parsing. Now suppose the parser determines that this element is a statement beginning with a varname. What kind of statement is it? There are four possibilities, and that's three more than are allowed.

But all is not lost. A BNF (or BNF-like) grammar is merely a possible description of a language. Most languages can be described using a different BNF grammar specification, without changing the structure of the language.

Recall the grammar restrictions from my previous post:


A grammar G is LL(1) if and only if:


  • ∀ A, A ≠ ∈,
    ∀ pairs A → u and A → v,
    FIRST(u) ∩ FIRST(v) = ∅

  • If ∃ A, A → ∈,
    FIRST(A) ∩ FOLLOW(A) = ∅




The FIRST and FOLLOW rules will help us determine where the grammar needs to be rewritten for an LL(1) parser.

In the example above, we have:

FIRST(addstatement) = varname
FIRST(substatement) = varname
FIRST(multstatement) = varname
FIRST(divstatement) = varname


Thus:

FIRST(addstatement) ∩ FIRST(substatement) = varname


The same is true for any pair taken from these four kinds of statement.

The solution is to create a special kind of statement:


assignstatement = varname ( addassign | divassign | multassign | subassign ) ';'


Now the parser sees only one kind of statement that begins with a varname: an assignstatement. When the parser finds a varname at the beginning of a statement, it recognizes it as an assignstatement and looks at the next token to determine what kind of assignstatement it is. Now we have a grammar the LL(1) parser can handle.

I've put both the full SPTL grammar and the first and follow rules (.csv file) online.

Labels: , ,

Thursday, January 8, 2009

SPTL grammar: expressions

Now that we've seen how names work in SPTL, we'll take a look at how to build expressions.

We start with a simple declaration: an expr consists of a boolexpr, followed by zero or more sets consisting of a boolop and another boolexpr. But what's a boolexpr? Simple, it's a relexpr followed optionally by a relop and another relexpr. And so on.

Here's the BNF, building all the way to factor, which has several options including (expr), which takes us back to the beginning.

The purpose of this hierarchy is to set precedence rules for the order of evaluation. Due to the nature of a recursive descent parser, items that appear later in this list will be evaluated first.


expr = boolexpr { boolop boolexpr }

boolexpr = relexpr [ relop relexpr ]

boolop = '&&' | '||'

relexpr = [ '-' ] term { addop term }

relop = '<' | '>' | '==' | '<=' | '>=' | '!='

term = factor { multop factor }

addop = '+' | '-'

factor = constant | varname | funcname | '(' expr ')' | negate

multop = '*' | '/' | '%'

negate = '!' factor

constant = boolconstant | numconstant | stringconstant


For example, the expression x + 3 < y * 2 can be parsed like this:

expr
-> boolexpr
-> relexpr relop relexpr
-> term addop term "<" term
-> factor "+" factor "<" factor mulop factor
-> varname "+" constant "<" varname "*" constant
-> "x" "+" 3 "<" "y" "*" "2"

But to really see how this works, we'll need to simulate a tree structure:


<
-------
/ \
+ \
--- *
/ \ ---
x 3 / \
y 2


Atrocious ascii drawing, I know. I need to get a graphical graphing tool. If you have any suggestions, let me know in the comments.

Anyway, the y * 2 will be evaluated first, since it is at the lowest level of the tree. Next to be evaluated will be x + 3. In this example, since these are on opposite sides of the < sign, the relative order between them does not matter. In other expressions, it will. Once the values of x + 3 and y * 2 are known, they can be compared with each other. If x + 3 is less than y * 2 the value of the entire expression is true, otherwise the value is false.

What I don't know at this point is how it will work to have boolean, string, and numeric constants in the same expression hierarchy. Somehow, the parser will have to be able to evaluate, for example, true + 3 >= 7, or 10 + "this", and return a sensible answer.

As I build the compiler I'll have to figure out how that will work. Probably the solution is to assign all string and numeric constants a boolean value for boolean expressions, and all string and boolean constants a numeric value for arithmetic expressions. Any variables, since SPTL is a dynamically typed language, must be able to be evaluated as a boolean, number, or string, depending on the context.

We're getting closer to having a working language.

Labels: ,

Saturday, December 13, 2008

SPTL grammar: names

A Simple Parrot Test Language (SPTL) program consists of an elementblock, which in turn consists of one or more simpleelements. This seeming redundancy results from having elementblocks as elements of some definitions. I don't want to say, for example, that every for, if, or repeat statement contains a program.

A simpleelement is a statement, a vardec, or a functiondef. Today we're going to look at vardecs.


vardec = 'new' ( scalardec | arraydec ) ';'

varname = scalarname | arrayname

scalardec = scalarname [ '=' expr ]

scalarname = '$' { '_' | letter | digit }+ { '[' expr ']' }

arraydec = arrayname [ '=' '(' exprarray ')' ]

arrayname = '@' { '_' | letter | digit }+ { '[' expr ']' }


SPTL, being a simple language, has only two kinds of variables: scalars and arrays. These use the same sigils as in Perl, $ and @ respectively.

A variable name is any combination of letters, digits, and underscores. Because all variable names are prepended with a sigil, there is no need to forbid names from beginning with digits, or indeed being entirely digits. The scanner can immediately see that $123 is a variable and not a number.

A variable declaration begins with the keyword new. A scalar or array can be declared with a null value by simply stating the name, or can be assigned a value (or multiple values if it's an array) by using the equals sign.

SPTL actually has other types of names. The next one we're going to look at is the filehandle:


filehandle = '^' { '_' | letter | digit }+


Filehandles are prepended with the sigil ^. This differs from Perl, where a filehandle can be either a bareword or a normal scalar. I've chosen to use a unique sigil for filehandles to simplify the work of the scanner. By requiring unique sigils for all types of names, the scanner knows from the beginning not only that this is a name, but just what sort of name it is.

The tradeoff is that this makes more work for the programmer. The ^ sigil is an extra thing to remember, and it's non-intuitive for a programmer coming from any other language. For a personal project like SPTL usability is beside the point; if this were a language that people were actually going to use, I'd do some things differently.

Finally we have functions:


functiondef = 'function' funcname '{' elementblock 'return' expr '}'

funcname = '&' { '_' | letter | digit }+


A function name is prepended with a & sigil. Superficially this resembles a Perl function call, which begins with an optional &. The difference can be seen, though, in the function definition. SPTL requires the & sigil here too. In Perl, because the & is not a sigil, the function definition has no & in the function name. In a milder but more noticeable departure from Perl syntax, function definitions use the keyword function rather than sub.

Inside the functiondef you can see the elementblock that I mentioned at the beginning of this post. Again, the definition of elementblock is identical to that of program, but the English meaning of the word program makes it a poor choice to describe an element of a program.

Finally, SPTL requires all functions to return a value, whether or not it will be used.

(Incidentally, I initially had defined program thus:


program = '{' elementblock '}'


but found it too laborious to wrap every test program in an extra set of curly braces for no other benefit than to give program and elementblock different definitions.)

So, by reviewing the types of names, we've covered two of the three types of simpleelements. We're almost done with the grammar. Ha!

Labels: , ,

Wednesday, December 3, 2008

introducing SPTL

There seems to be a long period of initial obscurity for any new language. Then after that comes a long period of semi-obscurity, followed by total obscurity.

Paul Bissex, quoted by Steve Yegge in The Next Big Language



So I've been working on the grammar for my language to target the Parrot Virtual Machine. As it turns out, creating a consistent context-free grammar is a lot harder than it may seem. Or maybe I've got a warped sense of what seems easy and what seems hard.

At any rate, the Simple Parrot Test Language (SPTL) is a small Perl-like language. In fact, at a glance it might appear to be a subset of Perl. However, some elements have different meanings in SPTL than in Perl. I'm going to spend the next few posts going over the details and explaining my design decisions. For now, the full BNF grammar can be found here (text file).

Labels: ,