ThinkGeek - Cool Stuff for Geeks and Technophiles

Sunday, February 22, 2009

FIRST and FOLLOW sets, part 2

In my last post we looked at FIRST sets. That's the hard part. Today's topic, FOLLOW sets, are easy.

In the grammar restrictions post I defined FOLLOW as

the set of possible first tokens for the following sentence


So now that we have the FIRST set for every grammar element, we can get each element's FOLLOW set by finding the FIRST set for every element that follows it.

Let's define:

S, A, B, and C as sentences in the grammar

The FOLLOW rules are:


  1. For S → AB, S → [A]B, S → {A}B

    • If ∀ B, B ≠ ∈
      FOLLOW(A) includes FIRST(B)
    • If ∃ B, B = ∈
      FOLLOW(A) includes FOLLOW(S)


  2. For S → AB, S → A[B], S → A{B}

    • FOLLOW(B) includes FOLLOW(N)


  3. For S → AB | AC

    • FOLLOW(A) includes FIRST(B) ∪ FIRST(C)




From these three simple rules (and the FIRST sets), we can construct the FOLLOW sets for all the elements of the grammar.

The complete set of FOLLOW sets for SPTL is available in gnumeric format or csv format.

Labels: , , ,

Sunday, February 8, 2009

FIRST and FOLLOW sets, part 1

In my last post, we considered the grammar restrictions imposed by an LL(1) parser. We looked at how FIRST sets for post-increment and post-decrement statements could lead to ambiguity for the parser, and how to fix the problem.

But even a simple language like SPTL has dozens of production rules, and for many of them the FIRST set may not be obvious.

For example:


program = elementblock

elementblock = {simpleelement}+

simpleelement = statement | vardec | functiondef


We've gone through three production rules, and not found anything resembling a SPTL token. And it continues:


statement = assignstatement | breakstatement | callstatement |
closestatement | foreachstatement | forstatement |
ifstatement | openstatement | popstatement |
predecstatement | preincstatement | printstatement |
pushstatement | readstatement | repeatstatement |
shiftstatement | whilestatement


At this rate, it will take forever to get anywhere. We need a method for constructing the set of FIRST sets. Fortunately, there are several rules which will make the job a lot easier.

But first, some definitions:

Terminal: The technical term for token. These are the symbols of the language.
Nonterminal: These are the terms that appear in the BNF but not in the language. A nonterminal will always have a production rule to transform it into something else.

Now we'll define:
A and B as nonterminals
w and x as terminals

And, as in the last post, the following symbols mean:

Empty set
Empty sentence
Intersection - the elements common to both sets
Union - the elements appearing in one or both sets


Now, the FIRST rules:



  1. FIRST(∈) = ∅
  2. FIRST(x) = x
  3. FIRST(w|x) = [w, x]
  4. FIRST(∈ x) = x;
  5. FIRST(w x) = w
      but if w can be empty, FIRST(w x) = [w, x]
  6. FIRST(A|B) = FIRST(A) ∪ FIRST(B)
  7. FIRST(A B) = FIRST(A)
      but if A can be empty, FIRST(A B) = FIRST(A) ∪ FIRST(B)



So far, so good. But how do we apply them?

We can see that every FIRST rule for a nonterminal gives us another FIRST rule, while all FIRST rules for terminals gives us a terminal (even if it's an empty terminal). Eventually we need to get a FIRST set of terminals for each rule. By building from the bottom up, we can get a FIRST set for each element of the BNF grammar.

For example, let's look at a portion of the expression rules for SPTL:


expr = boolexpr { boolop boolexpr }
boolexpr = relexpr [ relop relexpr ]
boolop = '&&' | '||'
relexpr = [ '-' ] term { addop term }
relop = '<' | '>' | '==' | '<=' | '>=' | '!='
term = factor { multop factor }
addop = '+' | '-'
factor = constant | varname | funcname | '(' expr ')' | negate
multop = '*' | '/' | '%'
negate = '!' factor
constant = boolconstant | numconstant | stringconstant
boolconstant = 'true' | 'false'
numconstant = digit {digit}
stringconstant = plainstring | escapestring
plainstring = ''' text '''
escapestring = '"' text '"'


If we start with expr = boolexpr { boolop boolexpr }, we get FIRST(expr) = FIRST(boolexpr), FIRST(boolexpr) = FIRST(relexpr), and so on. By starting at the bottom, we see:


FIRST(escapestring) = "
FIRST(plainstring) = '


By starting out at the bottom, with production rules that lead immediately to terminals, we can apply the FIRST rules repeatedly for the other production rules, to get a set of terminals for every production rule. So:


FIRST(stringconstant)
= FIRST(plainstring)FIRST(escapestring)
= [", ']


I've uploaded the complete set of FIRST sets for SPTL in gnumeric format or csv format.

My next post will look at FOLLOW sets.

Labels: , , ,