Parsing indented text
"How can I parse indented text with a grammar?" has turned into a frequently-asked question recently. People want to parse Python and CoffeScript.
My fix is double. First, here's Text::Indented, a module that does it.
Secondly, I'll now recreate my steps in creating this module. Each section will have a description of what needs to be done, a failing test, and then the appropriate implementation code to pass the test.
Quite a simple indent
We want to be able to handle indentation at all.
my $input = q:to/EOF/;
Level 1
Level 2
EOF
parses_correctly($input, 'single indent');
Well, that's easy. This grammar will do that:
regex TOP { .* }
(Kent Beck told me I can cheat, so I cheat!)
Too much indent for our own good
But there are some indent jumps that we're not allowed to make. Anything that indents more than one step at a time, basically. Let's check for that.
my $input = q:to/EOF/;
Level 1
Level 3!
EOF
fails_with($input, Text::Indented::TooMuchIndent);
This takes a little more code to fix. We declare an exception, start parsing lines, and separate each line into indent, extra whitespace, and the rest of the line. Finally we check the line's indent against the current indent — mediated by the contextual variable @*SUITES
. You'll see where I'm going with this in a minute.
class TooMuchIndent is Exception {}
constant TABSTOP = 4;
regex TOP {
:my @*SUITES = "root";
<line>*
}
sub indent { @*SUITES.end }
regex line {
^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?
{
my $new_indent = $0.chars div TABSTOP;
die TooMuchIndent.new
if $new_indent > indent() + 1;
}
}
(The <{ "\\x20" x TABSTOP }>
is a bit of a workaround. In Wonderful Perl 6 we would be able to write just [\x20 ** {TABSTOP}]
.)
Actual content
Having laid the groundworks, let's get our hands dirty. We want the content to end up, line by line, on the right scoping level.
my $input = q:to/EOF/;
Level 1
Level 2
EOF
my $root = parse($input);
isa_ok $root, Text::Indented::Suite;
is $root.items.elems, 2, 'two things were parsed:';
isa_ok $root.items[0], Str, 'a string';
isa_ok $root.items[1], Text::Indented::Suite, 'and a suite';
We need a Suite
(term borrowed from Python) to contain the indented lines:
class Suite {
has @.items;
}
This requires a slight amending of TOP
:
regex TOP {
:my @*SUITES = Suite.new;
<line>*
{ make root_suite }
}
The logic in line
to create new suites with new indents:
# ^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?
my $line = ~$2;
if $new_indent > indent() {
my $new_suite = Suite.new;
add_to_current_suite($new_suite);
increase_indent($new_suite);
}
add_to_current_suite($line);
For all this, I had to define some convenience routines:
sub root_suite { @*SUITES[0] }
sub current_suite { @*SUITES[indent] }
sub add_to_current_suite($item) { current_suite.items.push($item) }
sub increase_indent($new_suite) { @*SUITES.push($new_suite) }
But what about de-indenting?
We've handled indenting and creating new suites nicely, but what about de-indenting?
my $input = q:to/EOF/;
Level 1
Level 2
Level 1 again
EOF
my $root = parse($input);
is $root.items.elems, 3, 'three things were parsed:';
isa_ok $root.items[0], Str, 'a string';
isa_ok $root.items[1], Text::Indented::Suite, 'a suite';
isa_ok $root.items[2], Str, 'and a string';
Easily fixed with an elsif
case in our line
regex:
elsif $new_indent < indent() {
decrease_indent;
}
And a convenience routine:
sub decrease_indent { pop @*SUITES }
Hah, you missed multi-step de-indents!
Indenting multiple steps at a time isn't allowed... but de-indenting multiple steps is. (This may actually be the strongest point of this kind of syntax. It corresponds to the } } }
or end end end
case of languages with explicit block delimiters, and is arguably neater.)
my $input = q:to/EOF/;
Level 1
Level 2
Level 3
Level 3
Level 1 again
EOF
my $root = parse($input);
is $root.items.elems, 3, 'three things on the top level';
is $root.items[1].items[1].items.elems, 2, 'two lines on indent level 3';
Oh, but we only need to change one line in the implementation to support this:
decrease_indent until indent() == $new_indent;
And a half!
Now for some random sins. You're not supposed to indent partially, a non-multiple of the indent size.
my $input = q:to/EOF/;
Level 1
Level 2 and a half!
EOF
fails_with($input, Text::Indented::PartialIndent);
So we introduce a new exception.
class PartialIndent is Exception {}
And a condition that checks for this:
# ^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?
my $partial_indent = ~$1;
die PartialIndent.new
if $partial_indent;
What do you mean, "jumped the gun"?
Secondly, you're not meant to indent the first line; it has to be at indentation level 0.
my $input = q:to/EOF/;
Level 2 already on the first line!
EOF
fails_with($input, Text::Indented::InitialIndent);
We introduce another exception for that.
class InitialIndent is Exception {}
And a condition that matches our test case.
die InitialIndent.new
if !root_suite.items && $new_indent > 0;
The importance of handles
As a final clean-up refactor, let's change @.items
in Suite
to this:
class Suite {
has @.items handles <push at_pos Numeric Bool>;
}
It makes Suite
more Array
-like. Piece by piece:
push
allows us to push directly into aSuite
object, instead of into its.items
attribute.at_pos
allows us to indexSuites
directly. Things like$root.items[1]
in the tests turn into$root[1]
.Numeric
gets rid of the.elems
calls for us in the tests, and we can write$root.items.elems
as just+$root
instead.- Finally,
Bool
allows us to write!root_suite.items
as just!root_suite()
.
Somehow I liked doing this refactor last, after all the dust around the implementation had settled. It makes the API much more enjoyable to use, and hides a bunch of unnecessary steps along the way. I really like the way handles
saves a bunch of boring code.
Enjoy
Anyway, that's parsing of indented code. Not as tricky as I thought.
Now I fear I've damned myself to contribute this solution to arnsholt++'s budding py3k implementation. 哈哈