ThinkGeek - Cool Stuff for Geeks and Technophiles

Thursday, May 21, 2009

programming experiments

Bill the Lizard, in a post titled Programming and Experimentation, writes about his experience as a tutor of first-year CS students in college:

I would frequently have students bring code to me and ask me what I thought of it. Some would even go so far as to ask me if I thought their code would compile. I would never answer this question directly (despite Head First Java repeatedly urging me to "Be the Compiler"), but would instead patiently show them how to answer it for themselves. These students weren't lazy, they were scared. In many cases it seemed like they were more scared of being wrong than they were of not knowing the answer. [emphasis in original]


He then notes that curiosity is a trait shared by most or all of the best programmers, that curiosity leads to experimentation, and that a good teacher or mentor can impart that curiosity to others. It's a good post, well worth the reading.

But what jumped out at me is how this dovetails with something I read this week in Clay Shirky's Here Comes Everybody, on exactly how open source software threatens the proprietary model:

Open source is a profound threat, not because the open source ecosystem is outsucceeding commercial efforts but because it is outfailing them.


Shirky took a look through the project tree at SourceForge, and found that the vast majority of projects could not be considered successes in any sense of the word. But here's the key: Someone made the effort to start a project. Someone tried an experiment.

And a few, a small but significant number, built something worth using. Of these, most are niche software, useful to a few people. But a few have attracted a wide audience.

It's impossible to know at the outset which projects will be successful. What SourceForge and other open source project hosting sites provide is a platform for programmers to try something without a large up-front financial commitment. The more experiments, the greater the likelihood that someone will succeed.

That's why inspiring beginning programmers to try their own experiments is such a valuable gift. Bill the Lizard says it well:

If you show someone how experimentation works in programming, and you're enthusiastic about learning with them, they might catch the bug from you. I had very few students who were disappointed that I wouldn't just tell them the answers to their programming assignments. Most of them wanted to learn how to do it for themselves once they were convinced that they could.

Labels: , , ,

Saturday, May 16, 2009

SPTL: tokens and the symbol table

In the early stages of Jack Crenshaw's excellent Let's Build a Compiler series, his compiler can only handle single-character keywords. It would seem that this is a limitation of all LL(1) parsers, because the parser is only looking at one source character at a time.

There is a way around this limitation. We can preprocess the source code, replacing each element (no matter how long) with a single character. These characters are known as tokens, and the process is known as lexical analysis. We'll call the tool that handles lexical analysis the lexer. The lexer will scan the source code to determine how many characters make up the next element, then it will look up the element in the symbol table to determine the number to return. I'll discuss the SPTL lexer in more detail in a later post.

If we look through SPTL's grammar, we will find these keywords:


break
false
for
foreach
function
if
new
newline
number
pop
print
push
read
repeat
return
shift
true
until
while


and these operators:

+=
&&
@
=
,
.
--
/
/=
==
"
^
&
>=
>
++
{
[
(
<=
<
-
%
*
*=
!
!=
||
\
+
}
]
)
$
;
-=


Before the compiler even looks at the source code, the keywords and symbols will all be pre-loaded into a symbol table, and assigned unique numeric values. The numeric values are the tokens that will be passed to the parser.

But most programs consist of more than keywords and operators. The code will likely contain user-defined names. These names will need to be loaded into the symbol table, too, but we can't load them until we encounter them. Thus our lexer will need to be able to interact with the symbol table, not only to retrieve tokens but to add to the table.

Our symbol table will look something like this:



break 0
false 1
for 2
foreach 3
function 4
if 5
[et cetera]



I'll use Perl's associative array (hash) data structure to build the table. This way, the lexer can simply look up a symbol directly:


if (defined({'lexer'}->{'symbol'}->{'table'}{$token}))
{
# retrieve the value
}


rather than iterating through the values sequentially:


my $i = 0;
while($i < {'lexer'}->{'symbol'}->{'count'} && !$flag) {
# search sequentially until we find the right
# value; if there are 3000 keywords/operators/
# names, it will takean average of 1500
# iterations to find the right one
if({'lexer'}->{'symbol'}->{'table'}{$i} eq $token)
{
# finally!
$flag = 1;
}
$i++;
}


One more thing: You may have noticed that the symbol table includes a 'newline' keyword, even though it's not part of the language. Its purpose is to let the lexer preserve newlines when passing tokens to the parser. This way, the parser can report the line number if there is an error, and confirm the total number of lines parsed.

You can see the Perl module that builds the symbol table here.

Labels: , , ,

Monday, May 11, 2009

name that month: adventures in refactoring

One day, at the place I was working at the time, I was debugging some legacy PHP code from a predecessor when I found this:



function get_month_name ($month)
{
$month = $month + 1;
if ($month == 1) {
$name = 'Jan';
}
if ($month == 2) {
$name = 'Feb';
}
if ($month == 3) {
$name = 'Mar';
}
if ($month == 4) {
$name = 'Apr';
}
if ($month == 5) {
$name = 'May';
}
if ($month == 6) {
$name = 'Jun';
}
if ($month == 7) {
$name = 'Jul';
}
if ($month == 8) {
$name = 'Aug';
}
if ($month == 9) {
$name = 'Sep';
}
if ($month == 10) {
$name = 'Oct';
}
if ($month == 11) {
$name = 'Nov';
}
if ($month == 12) {
$name = 'Dec';
}
return $name;
}



I've changed a few details, but the function really did contain twelve if() statements, and no else() clause. The function checked for all twelve months, regardless of which month was passed to it. The variable $month is evidently coming from something that is zero-based, and I'm guessing the poor programmer just couldn't wrap his/her brain around this:



if($month == 2) {
$name = 'Mar';
}



Now this wasn't the function I was tasked with debugging. The correct month abbreviation actually printed on the web page, so the code wasn't absolutely wrong. Still, I couldn't muster the will to let it go.

The most trivial improvement would be to change each if() except the first one to elseif(). That way, at least, the function could stop checking the value of $month when it found the correct month. But we can do even better than that.

A marginally better solution would be to replace the if() with a switch() statement:



function get_month_name ($month)
{
$month++;
switch ($month) {
case 1:
$name = 'Jan';
break;
case 2:
$name = 'Feb';
break;
case 3:
$name = 'Mar';
break;
case 4:
$name = 'Apr';
break;
//et cetera
}
}



This suffers from the same limitations of checking every value sequentially, but it removes the twelve == operators, and the possibility of accidentally typing one as a single = and returning the wrong month. On the other hand, it opens the possibility of omitting a break; and returning the wrong month.

A better solution is this:



function get_month_name ($month)
{
$names = array ('Jan', 'Feb', 'Mar',
'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep',
'Oct', 'Nov', 'Dec');
return $names[$month];
}



With one stroke we've removed the unnecessary incrementing of $month and the multiple comparisons.

On top of that, the $names array was already declared as a global variable at the top of the file. I didn't even need to make a local variable, although it's better to do so. Defining it locally ensures that changes to the global variable won't adversely affect this function.

So that's the solution I used, but could I have done even better? Many languages have a library function to return the month name or abbreviation, given the month number. PHP is not my primary language, but I'd be willing to bet that somewhere among PHP's 4376* built-in functions is one to handle this.



* estimated

Labels: , ,