Strangely Consistent

Musings about programming, Perl 6, and programming Perl 6

Parsing indented text

"How can I parse indented text with a grammar?" has turned into a frequently-asked question recently. People want to parse Python and CoffeScript.

My fix is double. First, here's Text::Indented, a module that does it.

Secondly, I'll now recreate my steps in creating this module. Each section will have a description of what needs to be done, a failing test, and then the appropriate implementation code to pass the test.

Quite a simple indent

We want to be able to handle indentation at all.

    my $input = q:to/EOF/;
    Level 1
        Level 2
    EOF

    parses_correctly($input, 'single indent');

Well, that's easy. This grammar will do that:

regex TOP { .* }

(Kent Beck told me I can cheat, so I cheat!)

Too much indent for our own good

But there are some indent jumps that we're not allowed to make. Anything that indents more than one step at a time, basically. Let's check for that.

    my $input = q:to/EOF/;
    Level 1
            Level 3!
    EOF

    fails_with($input, Text::Indented::TooMuchIndent);

This takes a little more code to fix. We declare an exception, start parsing lines, and separate each line into indent, extra whitespace, and the rest of the line. Finally we check the line's indent against the current indent — mediated by the contextual variable @*SUITES. You'll see where I'm going with this in a minute.

class TooMuchIndent is Exception {}

constant TABSTOP = 4;

regex TOP {
    :my @*SUITES = "root";

    <line>*
}

sub indent { @*SUITES.end }

regex line {
    ^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?

    {
        my $new_indent = $0.chars div TABSTOP;

        die TooMuchIndent.new
            if $new_indent > indent() + 1;
    }
}

(The <{ "\\x20" x TABSTOP }> is a bit of a workaround. In Wonderful Perl 6 we would be able to write just [\x20 ** {TABSTOP}].)

Actual content

Having laid the groundworks, let's get our hands dirty. We want the content to end up, line by line, on the right scoping level.

    my $input = q:to/EOF/;
    Level 1
        Level 2
    EOF

    my $root = parse($input);

    isa_ok $root, Text::Indented::Suite;
    is $root.items.elems, 2, 'two things were parsed:';
    isa_ok $root.items[0], Str, 'a string';
    isa_ok $root.items[1], Text::Indented::Suite, 'and a suite';

We need a Suite (term borrowed from Python) to contain the indented lines:

class Suite {
    has @.items;
}

This requires a slight amending of TOP:

regex TOP {
    :my @*SUITES = Suite.new;

    <line>*

    { make root_suite }
}

The logic in line to create new suites with new indents:

# ^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?

my $line = ~$2;

if $new_indent > indent() {
    my $new_suite = Suite.new;
    add_to_current_suite($new_suite);
    increase_indent($new_suite);
}

add_to_current_suite($line);

For all this, I had to define some convenience routines:

sub root_suite { @*SUITES[0] }
sub current_suite { @*SUITES[indent] }
sub add_to_current_suite($item) { current_suite.items.push($item) }
sub increase_indent($new_suite) { @*SUITES.push($new_suite) }

But what about de-indenting?

We've handled indenting and creating new suites nicely, but what about de-indenting?

    my $input = q:to/EOF/;
    Level 1
        Level 2
    Level 1 again
    EOF

    my $root = parse($input);

    is $root.items.elems, 3, 'three things were parsed:';
    isa_ok $root.items[0], Str, 'a string';
    isa_ok $root.items[1], Text::Indented::Suite, 'a suite';
    isa_ok $root.items[2], Str, 'and a string';

Easily fixed with an elsif case in our line regex:

elsif $new_indent < indent() {
     decrease_indent;
}

And a convenience routine:

sub decrease_indent { pop @*SUITES }

Hah, you missed multi-step de-indents!

Indenting multiple steps at a time isn't allowed... but de-indenting multiple steps is. (This may actually be the strongest point of this kind of syntax. It corresponds to the } } } or end end end case of languages with explicit block delimiters, and is arguably neater.)

    my $input = q:to/EOF/;
    Level 1
        Level 2
            Level 3
            Level 3
    Level 1 again
    EOF

    my $root = parse($input);

    is $root.items.elems, 3, 'three things on the top level';
    is $root.items[1].items[1].items.elems, 2, 'two lines on indent level 3';

Oh, but we only need to change one line in the implementation to support this:

decrease_indent until indent() == $new_indent;

And a half!

Now for some random sins. You're not supposed to indent partially, a non-multiple of the indent size.

    my $input = q:to/EOF/;
    Level 1
          Level 2 and a half!
    EOF

    fails_with($input, Text::Indented::PartialIndent);

So we introduce a new exception.

class PartialIndent is Exception {}

And a condition that checks for this:

# ^^ (<{ "\\x20" x TABSTOP }>*) (\h*) (\N*) $$ \n?

my $partial_indent = ~$1;

die PartialIndent.new
    if $partial_indent;

What do you mean, "jumped the gun"?

Secondly, you're not meant to indent the first line; it has to be at indentation level 0.

    my $input = q:to/EOF/;
        Level 2 already on the first line!
    EOF

    fails_with($input, Text::Indented::InitialIndent);

We introduce another exception for that.

class InitialIndent is Exception {}

And a condition that matches our test case.

die InitialIndent.new
    if !root_suite.items && $new_indent > 0;

The importance of handles

As a final clean-up refactor, let's change @.items in Suite to this:

class Suite {
    has @.items handles <push at_pos Numeric Bool>;
}

It makes Suite more Array-like. Piece by piece:

Somehow I liked doing this refactor last, after all the dust around the implementation had settled. It makes the API much more enjoyable to use, and hides a bunch of unnecessary steps along the way. I really like the way handles saves a bunch of boring code.

Enjoy

Anyway, that's parsing of indented code. Not as tricky as I thought.

Now I fear I've damned myself to contribute this solution to arnsholt++'s budding py3k implementation. 哈哈

Lexpads and why roles need fixups

We need a solution that makes us need less vodka. — jnthn

There are many extremely simple and elegant software solutions out there. But there are also those special moments, when you realize that something is more complex than you thought, and that the complexity is most likely essential.

Character encodings are the prototypical example for me. Certainly datetime handling qualifies as well.

Reaching the realization that there is that extra essential complexity, comes (at least for me) with a sinking feeling as I get used to the idea of living with that complexity forever.

With me so far? Something seemed quite easy, wrapped up, ready to go home for the day, but then all this extra complexity rears its head. And it's never going away.

I started writing this blog post because I realized that a certain snag in role handling in Rakudo doesn't have a URL, and it really should. So, without fanfare, here's the situation:

my $x;

role R {
    method foo {
        say $x;
    }
}

class C does R {
}

$x = "OH HAI";
C.new.foo;

I think we all agree that this should print OH HAI. Good? Good. Nothing up my sleeve, no hidden mirrors or escape hatches — it does print OH HAI. Relax. Take a deep breath.

Ready? Because after you learn this, there's no going back. The world will forever be more complicated and, with some luck, you'll be having that sinking feeling.

Ok, so. Just a few simple facts:

Let's recap what we know by applying it to the code. There's the variable $x. We know we will find it in the static lexpad of the mainline, because it's declared on the top level and everything has a static lexpad. Does it also have a runtime lexpad? Yes, it does, because the mainline starts running after compilation is over. Will we find $x in several runtime lexpads? No, only the one.

Now, we ask ourselves the question: which lexpad is C.foo referring to?

"Of course, it's the runtime lexpad", we reply, innocent to the fact that the trap has already shut around us and there's no way out. See, it has to be the runtime lexpad, because the sane thing for the program to do is to print OH HAI, and that value is certainly stored in the runtime lexpad.

But no. It's not possible. It can't. There's no way. Because roles are created at compile time, before there is a runtime lexpad! The role method has no choice: it's bound to the static lexpad, because at that point, that's all there is.

And there we are. The trap has now closed. There's no way to both (a) do what the user expects, and (b) keep the internal model nice and free of weird exceptions.

Since we like (a), we ditch (b) and create an exception in Rakudo. It's called a fixup, it's installed during role creation, and it makes sure that whenever the block surrounding the role is entered, the role rebinds its OUTER to that block's fresh lexpad.

Simple it ain't. Nor is it pretty. But it makes the user happy.

The reason I started thinking about this is that we run into the same kind of problem with macros, and the same kind of fixup will probably be needed there.

More to the point, at the point where this need-for-a-fixup starts showing up in different parts of the architecture, it's time to give it a name and perhaps think of a uniform way to address this. That's where jnthn's quote from the start of the post originates — we need a solution that isn't worse than the problem, and that we can reason about without having to scale the Ballmer peak.

t3: Wire crossings

Apparently I solemnly swore in the last p6cc blog post that this blog post would appear sooner than after a two-month break. Apparently I suck at living up to what I solemnly swear. Anyway, kudos to $dayjob for keeping me healthily occupied with things, and a big thanks to all people related to Perl 6 who gently reminded me to keep up with the reviewing.

(Or maybe I didn't specifically mean "the next p6cc blog post", but just "the next blog post"? I wish I could make myself believe that. No, the real answer is that reviewing stuff takes time and effort, and my time and effort have been elsewhere lately.)

Let's talk about crossing wires in elegant ways! Here, I'll let the description refresh your mind:

## Arrange wire crossings to rearrange wires

Ten wires are come in from the left. Ten wires leave to the right. Write a
program that, given a permutation of the digits 0..9, arranges crossings on a
grid that places the wires in the order designated by the permutation.

The output consists of a grid whose elements connect the left and right sides.
Each cell of the grid is either *empty* (in that it just lets the wires
through), or a *crossing* (in that it lets the wires trade places). Two
crossings can never be placed vertically adjacent to each other. (This is
equivalent to saying that the wires never split or merge, they only permute.)

Often, solutions require crossings to be spread out not just vertically but
also horizontally. It is considered elegant not to make the grid wider than it
has to be.

Examples:

    Input: 1032547698

    Output:

    0 _  _ 1
       \/
    1 _/\_ 0

    2 _  _ 3
       \/
    3 _/\_ 2

    4 _  _ 5
       \/
    5 _/\_ 4

    6 _  _ 7
       \/
    7 _/\_ 6

    8 _  _ 9
       \/
    9 _/\_ 8

(This permutation is simply flipping wires pairwise.)

    Input: 1234567890

    Output:

    0 _  _________________ 1
       \/
    1 _/\  _______________ 2
         \/
    2 ___/\  _____________ 3
           \/
    3 _____/\  ___________ 4
             \/
    4 _______/\  _________ 5
               \/
    5 _________/\  _______ 6
                 \/
    6 ___________/\  _____ 7
                   \/
    7 _____________/\  ___ 8
                     \/
    8 _______________/\  _ 9
                       \/
    9 _________________/\_ 0

(The simplest way to bubble 0 to the bottom.)

    Input: 5012367894

    0 _________  _ 5
               \/
    1 _______  /\_ 0
             \/
    2 _____  /\___ 1
           \/
    3 ___  /\_____ 2
         \/
    4 _  /\_______ 3
       \/
    5 _/\  _______ 6
         \/
    6 ___/\  _____ 7
           \/
    7 _____/\  ___ 8
             \/
    8 _______/\  _ 9
               \/
    9 _________/\_ 4

(The simplest way to bubble 4 and 5 simultaneously.)

The reviews are in. To get the full enjoyment out of this blog post, I highly recommend that you read the reviews as well as this post. The solutions are a varied bunch, and there's lots of nice code in there.

How would a program find a nice, short solution to the write-crossing problem? Wait, can we even be sure there always is a solution? If the fundamental operation is crossing two adjacent wires, can we really generate the full space of permutations? (As opposed to, say, only half the space, like in the 15 puzzle.)

We can generate the full space of permutations. The quickest way to convince ourselves of that is to think of sorting algorithms, many of which use "flip two adjacent values" as its fundamental operation. Sorting algorithms can sort anything, hence the wire-crossing problem always has a solution.

(Wouldn't it be weird to live in a world where sometimes you'd pass in a list to be sorted, and the computer would come back and say "sorry, this is one of those unsortable lists of values". What a bummer!)

In fact, many of the solutions took the sorting analogy to heart, producing something like a bubble sort with slightly modified rules. In bubble sort, the same value can be transposed several times during one run, something that isn't possible in the wire universe: you flip two writes, and you then have to wait until the next column to flip either of those wires again. But with that little restriction added, bubble sort seems to solve this problem just fine.

As always, it's nice to see how people's styles differ broadly. I never see two identical solutions, but this time around, it felt especially varied. Maybe because the problem is relatively small, and one wouldn't think there were that much to vary. But just watch as people apply dynamic variables, feed operators, enums, junctions, sequence operators, metaoperators, and many other things to solve the same problem. There Is indeed More Than One Way To Do It.

As of last review post, we were down to five finalists. Now we're down to four.

Next post: pouring water on a block world!