ThinkGeek - Cool Stuff for Geeks and Technophiles

Sunday, February 22, 2009

FIRST and FOLLOW sets, part 2

In my last post we looked at FIRST sets. That's the hard part. Today's topic, FOLLOW sets, are easy.

In the grammar restrictions post I defined FOLLOW as

the set of possible first tokens for the following sentence


So now that we have the FIRST set for every grammar element, we can get each element's FOLLOW set by finding the FIRST set for every element that follows it.

Let's define:

S, A, B, and C as sentences in the grammar

The FOLLOW rules are:


  1. For S → AB, S → [A]B, S → {A}B

    • If ∀ B, B ≠ ∈
      FOLLOW(A) includes FIRST(B)
    • If ∃ B, B = ∈
      FOLLOW(A) includes FOLLOW(S)


  2. For S → AB, S → A[B], S → A{B}

    • FOLLOW(B) includes FOLLOW(N)


  3. For S → AB | AC

    • FOLLOW(A) includes FIRST(B) ∪ FIRST(C)




From these three simple rules (and the FIRST sets), we can construct the FOLLOW sets for all the elements of the grammar.

The complete set of FOLLOW sets for SPTL is available in gnumeric format or csv format.

Labels: , , ,

Sunday, February 8, 2009

FIRST and FOLLOW sets, part 1

In my last post, we considered the grammar restrictions imposed by an LL(1) parser. We looked at how FIRST sets for post-increment and post-decrement statements could lead to ambiguity for the parser, and how to fix the problem.

But even a simple language like SPTL has dozens of production rules, and for many of them the FIRST set may not be obvious.

For example:


program = elementblock

elementblock = {simpleelement}+

simpleelement = statement | vardec | functiondef


We've gone through three production rules, and not found anything resembling a SPTL token. And it continues:


statement = assignstatement | breakstatement | callstatement |
closestatement | foreachstatement | forstatement |
ifstatement | openstatement | popstatement |
predecstatement | preincstatement | printstatement |
pushstatement | readstatement | repeatstatement |
shiftstatement | whilestatement


At this rate, it will take forever to get anywhere. We need a method for constructing the set of FIRST sets. Fortunately, there are several rules which will make the job a lot easier.

But first, some definitions:

Terminal: The technical term for token. These are the symbols of the language.
Nonterminal: These are the terms that appear in the BNF but not in the language. A nonterminal will always have a production rule to transform it into something else.

Now we'll define:
A and B as nonterminals
w and x as terminals

And, as in the last post, the following symbols mean:

Empty set
Empty sentence
Intersection - the elements common to both sets
Union - the elements appearing in one or both sets


Now, the FIRST rules:



  1. FIRST(∈) = ∅
  2. FIRST(x) = x
  3. FIRST(w|x) = [w, x]
  4. FIRST(∈ x) = x;
  5. FIRST(w x) = w
      but if w can be empty, FIRST(w x) = [w, x]
  6. FIRST(A|B) = FIRST(A) ∪ FIRST(B)
  7. FIRST(A B) = FIRST(A)
      but if A can be empty, FIRST(A B) = FIRST(A) ∪ FIRST(B)



So far, so good. But how do we apply them?

We can see that every FIRST rule for a nonterminal gives us another FIRST rule, while all FIRST rules for terminals gives us a terminal (even if it's an empty terminal). Eventually we need to get a FIRST set of terminals for each rule. By building from the bottom up, we can get a FIRST set for each element of the BNF grammar.

For example, let's look at a portion of the expression rules for SPTL:


expr = boolexpr { boolop boolexpr }
boolexpr = relexpr [ relop relexpr ]
boolop = '&&' | '||'
relexpr = [ '-' ] term { addop term }
relop = '<' | '>' | '==' | '<=' | '>=' | '!='
term = factor { multop factor }
addop = '+' | '-'
factor = constant | varname | funcname | '(' expr ')' | negate
multop = '*' | '/' | '%'
negate = '!' factor
constant = boolconstant | numconstant | stringconstant
boolconstant = 'true' | 'false'
numconstant = digit {digit}
stringconstant = plainstring | escapestring
plainstring = ''' text '''
escapestring = '"' text '"'


If we start with expr = boolexpr { boolop boolexpr }, we get FIRST(expr) = FIRST(boolexpr), FIRST(boolexpr) = FIRST(relexpr), and so on. By starting at the bottom, we see:


FIRST(escapestring) = "
FIRST(plainstring) = '


By starting out at the bottom, with production rules that lead immediately to terminals, we can apply the FIRST rules repeatedly for the other production rules, to get a set of terminals for every production rule. So:


FIRST(stringconstant)
= FIRST(plainstring)FIRST(escapestring)
= [", ']


I've uploaded the complete set of FIRST sets for SPTL in gnumeric format or csv format.

My next post will look at FOLLOW sets.

Labels: , , ,

Thursday, January 29, 2009

LL(1) grammar restrictions

Before I discuss the statement syntax for SPTL, I want to look at grammar restrictions. As I mentioned in a previous post, I'm planning to build a recursive descent (top down) parser for SPTL. Specifically, it will be an LL(1) parser. That is, parsing from Left to right, with a Lookahead of 1 token.

This places restrictions on the grammar. The parser must be able to determine, when it finds a token, what grammar rule to apply. If a language had these rules:


statement : incstatement | decstatement
incstatement : var '++'
decstatement : var '--'


then the parser wouldn't know when it encountered a var whether this was an incstatement or a decstatement.

For a grammar to qualify as LL(1), it must satisfy a set of rules.

Define:

A as a sentence (one or more tokens) in the grammar
u and v as sentences derived from A
w and x as unspecified sentences
FIRST as the set of possible first tokens for the sentence
FOLLOW as the set of possible first tokens for the following sentence


The following symbols mean:

For all
There exists
Empty set
Empty sentence
Intersection (the elements common to both sets)


A grammar G is LL(1) if and only if:


  • ∀ A, A ≠ ∈,
    ∀ pairs A → u and A → v,
    FIRST(u) ∩ FIRST(v) = ∅

  • If ∃ A, A → ∈,
    FIRST(A) ∩ FOLLOW(A) = ∅



A pair is any two elements that can appear in the same part of the sentence (in the BNF-like grammar above, these are denoted by the | symbol).

So, going back to the incstatement/decstatement example above,
statement is A, the sentence we are looking at
incstatement and decstatement are u and v respectively
FIRST(incstatement) is {var}
FIRST(decstatement) is {var}

FIRST(incstatement)FIRST(decstatement) = {var}
{var} ≠ ∅

Therefore, the grammar above is not LL(1).

How can we change it into an LL(1) grammar? Simple:


statement = var (incop | decop)
incop = '++'
decop = '--'


Now the var is part of statement, and in place of incstatement/decstatement we have incop/decop.

FIRST(incop) is {'++'}
FIRST(decop) is {'--'}

FIRST(incop)FIRST(decop) = ∅

Now the grammar (or at least this portion of it) is LL(1).

This is a very simple example. But how do we determine FIRST and FOLLOW for all sentences of a grammar? I'll explain in my next post.

Labels: , ,