matatu blog

Brian writes about computing

markatu

Oct 11, 2018

Inventing a lightweight markup language.

So, I started writing this blog using markdown. But, I soon found that markdown wasn't able to generate the kind of HTML that I wanted. In this article, I talk about the techniques I used to invent my own lightweight markup language. I took inspiration from markdown's brevity and slim's flexibility, and threw in some constructs from high-level programming languages.

If you're more interested in the final product than the journey, you can check out the final git repository, which has a command line tool for turning things like this into HTML:

h2#title: markatu

small.w3-right: Oct 11, 2018

h3#subtitle: Inventing a lightweight markup language.

So, I started writing this blog using markdown.  But, I
soon found that markdown wasn't able to generate the kind
of HTML that I wanted.  In this article, I talk about
the techniques I used to invent my own lightweight markup
language.  I took inspiration from markdown's brevity
and <slim:slim-lang.org>'s flexibility, and threw in
some constructs from high-level programming languages.

If you're more interested in
the final product than the journey, you can check out
the final git <repository:https://github.com/bduggan/markatu>,
which has a command line tool for turning things like this
into HTML:

example=div.w3-panel,w3-card,w3-light-grey,w3-code {
+INCLUDE index.mt 1-25
}

Some features of the final language:

Some features of the final language:

Uses punctuation for things like bold, bullets and inline code. (like markdown).
Can generate arbitrary nested tags with attributes, including ids and classes. (like slim )
Uses blank lines to separate paragraphs (like markdown).
Supports aliases (like example above).
Supports including other files, as well as running them, and capturing their output.

Anyway, here are the techniques I used to make a parser and generate HTML. By the way, if you like examples instead -- the source code for this blog entry is at the bottom of this page.

Let's start with paragraphs: blocks of text separated by blank lines.

The grammar on the right parses paragraphs.

The % is a shortcut for "separated by".

So, % "\n\n" matches paragraphs which are separated by two newlines in a row.

Similarly, a paragraph is a sequence of lines separated by single newlines. \N matches anything except a \n.

Note that we have a regex, a rule, and a token. A token is a regex without backtracking (like a lexer). A rule is a token but spaces in the rule match whitespace in the input.

Here's the output →

When we print the value returned by parse using say, we get a nice little tree of matches.

grammar Markatu::Grammar {
  rule TOP {
    <p>+ % "\n\n"
  } 
  regex p {
    <line>+ % "\n"
  }
  token line {
      \N+
  }
}

say Markatu::Grammar.parse: q:to/END/;
A paragraph.

A paragraph that
has two lines.
END

｢A paragraph.

A paragraph that
has two lines.
｣
 p => ｢A paragraph.｣
  line => ｢A paragraph.｣
 p => ｢A paragraph that
has two lines.｣
  line => ｢A paragraph that｣
  line => ｢has two lines.｣

You are probably saying, okay, I could have just called split("\n\n") to get all the paragraphs, and you are right, but stick with me, it gets better.

We have a tree, but we want HTML, so let's make a quick Node class to represent a DOM node.

A node has a tag, maybe some attributes (a hash), maybe some text (a scalar), and maybe some children (an array). Rendering is recursive.

We could use typing and declare the types of things too (e.g. all the children are Nodes) but for now I want to be lazy and quick.

And anyhow we use sigils to at least indicate the container type of the attributes: $ is a scalar,`%` is a Hash of attributes, and @ makes an Array of children.

By the way, we make the tag optional, so that we can have elements of the DOM tree that just group other elements together.

Okay, here's the output of the code on the right.

<div><p>hello</p>
<pre id="earth">world</pre>
</div>

class Node {
  has $.tag;
  has $.text = '';
  has %.attrs;
  has @.children;

  method open-tag {
      return "" unless $.tag;
      "<$.tag"
      ~ (
         %.attrs.kv.map: { qq[ $^key="$^value"] }
        ).join
      ~ ">"
  }
  method close-tag {
      return "" unless $.tag;
      "</$.tag>\n"
  }
  method render {
      self.open-tag
      ~ @.children.map({ .render // '' }).join
      ~ $.text
      ~ self.close-tag;
  }
}

say Node.new(
      :tag<div>,
      children => (
        Node.new(:tag<p>, :text<hello>),
        Node.new(:tag<pre>, :text<world>,
          :attrs( %(id => 'earth') ))
      )
    ).render;

Let's put these two together and generate some HTML.

To generate something, we make an "actions" class -- a class whose methods have the same names as the rules in the grammar. When the grammar matches a rule against some portion of the input, the corresponding method in the actions class is called.

The argument that comes in, $/ is the match object -- which references the current text that was matched. It has a little stash that can be accesed by calling .make (to set a value) or .made (to get a value). In our case we are making a dom tree so we will be sending Node objects to .make and retrieving them with .made.

Again -- the output is below the code on the right.

So, okay, maybe that was more work than writing HTML. But the input was this:

A paragraph.

A paragraph that
has two lines.

And now we can have some fun and make our language a bit better.

class Markatu::Actions {
    method TOP($/) {
      $/.make:
        Node.new:
          children => $<p>.map: *.made
    }
    method p($/) {
      $/.make:
        Node.new:
          :tag<p>,
          :text($<line>.map(*.made).join("\n"))
    }
    method line($/) {
      $/.make: "$/"
    }
}

my $actions = Markatu::Actions.new;
my $match = Markatu::Grammar.parse: q:to/END/, :$actions;
A paragraph.

A paragraph that
has two lines.
END
say $match.made.render

<p>A paragraph.</p>
<p>A paragraph that
has two lines.</p>

First let's do some basics like bold or monospace.

We break up our line into phrases, break up our phrases into characters.

｢Some *bold*, some `code`.｣
 phrase => ｢Some ｣
 phrase => ｢*bold*｣
  bold => ｢bold｣
 phrase => ｢, some ｣
 phrase => ｢`code`｣
  code => ｢code｣
 phrase => ｢.｣
｢Some `code with a * in it`.｣
 phrase => ｢Some ｣
 phrase => ｢`code with a * in it`｣
  code => ｢code with a * in it｣
 phrase => ｢.｣

And use these new definitions to construct DOM nodes.

grammar G {
  token line {
     <phrase>+
  }
  token phrase {
    <bold> | <code> | <-[`*]>+
  }
  token bold {
    '*' <( <-[*]>+ )> '*'
  }
  token code {
    '`' <( <-[`]>+ )> '`'
  }
}

say G.parse: :rule<line>, 'Some *bold*, some `code`.';
say G.parse: :rule<line>, 'Some `code with a * in it`.';

I'm going to skip some of the boring stuff of building new nodes etc, and instead fast forward to an interesting part: making nested tags with lists of classes and ids, and assigning them to aliases.

Here's the source.

h1.title: An h1 whose class is "title".

div.w3-col,s6 {
  Inside a div with two classes.

  Still inside a div.
}

half=div.w3-col,s6 {
  I am tired of typing names of
  classes.  Let's make "half" an alias.
}

half {
  This is the same as `div.w3-col,s6`.
}

Here's the rendered HTML.

<h1 class="title">
  An h1 whose class is "title".
</h1>
<div class="w3-col s6">
  <p>
    Inside a div with two classes.
  </p>
  <p>
    Still inside a div.
  </p>
</div>
<div class="w3-col s6">
  <p>
    I am tired of typing names of
    classes.  Let's make "half" an alias.
  </p>
</div>
<div class="w3-col s6">
  <p>
    This is the same as <code>div.w3-col,s6</code>.
  </p>
</div>

And here's the relevant part of the parser.

token label {
  [$<declare-variable>=\w+ '=']?
  $<tag>=[\w+]
  ['#' $<id>=\w+ ]?
  ['.' <class-list>]?
}

rule tag {
  <label>
  [
   | ':' $<text>=\V*
   | '{' "\n"?
      [ <blocks> "\n"? ]+ % "\n"
     '}'
  ]
}

Well, for more details, head and over to the github repository. There you can find the source, a test suite with lots of examples, as well as mt -- a command line tool for converting files from markatu into HTML.

Conclusions

Parsing and inventing languages can be fun.
Perl 6 Grammars provide nice building blocks for experimenting with languages.
Lightweight markup languages can be programmer-friendly.

Here is the source for this blog entry:

h2#title: markatu

small.w3-right: Oct 11, 2018

h3#subtitle: Inventing a lightweight markup language.

So, I started writing this blog using markdown. But, I
soon found that markdown wasn't able to generate the kind
of HTML that I wanted. In this article, I talk about
the techniques I used to invent my own lightweight markup
language. I took inspiration from markdown's brevity
and <slim:slim-lang.org>'s flexibility, and threw in
some constructs from high-level programming languages.

If you're more interested in
the final product than the journey, you can check out
the final git <repository:https://github.com/bduggan/markatu>,
which has a command line tool for turning things like this
into HTML:

example=div.w3-panel,w3-card,w3-light-grey,w3-code {
+INCLUDE index.mt 1-25
}

Some features of the final language:

ul {

* Uses punctuation for things like bold, bullets and inline code. (like markdown).

* Can generate arbitrary nested tags with attributes, including ids and classes. (like <slim:slim-lang.org>)

* Uses blank lines to separate paragraphs (like markdown).

* Supports aliases (like `example` above).

* Supports including other files, as well as running them, and capturing their output.

}

Anyway, here are the techniques I used to make a parser and generate
HTML. By the way, if you like examples instead -- the source code
for this blog entry is at the <bottom:^quine> of this page.

div.w3-row-padding(r) {
div.w3-col,m6,s12(c) {
Let's start with paragraphs: blocks of text separated by blank lines.

The grammar on the right parses paragraphs.

The `%` is a shortcut for "separated by".

So, `% "\n\n"` matches paragraphs which are separated by
two newlines in a row.

Similarly, a paragraph is a sequence of lines separated by single
newlines. `\N` matches anything except a `\n`.

Note that we have a `regex`, a `rule`, and a `token`.
A `token` is a `regex` without backtracking (like a lexer).
A `rule` is a `token` but spaces in the rule match whitespace
in the input.

Here's the output →

When we print the value returned by `parse` using `say`, we get
a nice little tree of matches.
}

c {
+CODE first.p6

+OUTPUT first.p6
}
}

You are probably saying, okay, I could have just called `split("\n\n")`
to get all the paragraphs, and you are right, but stick with me, it gets better.

r {
c {
We have a tree, but we want HTML, so let's
make a quick `Node` class to represent a DOM node.

A node has a tag, maybe some attributes (a hash), maybe
some text (a scalar), and maybe some children (an array).
Rendering is recursive.

We could use typing and declare the types of things too (e.g.
all the children are `Node`s) but for now I want to be lazy
and quick.

And anyhow we use sigils to at least indicate the container type
of the attributes: `$` is a scalar,`%` is a `Hash` of attributes,
and `@` makes an `Array` of children.

By the way, we make the tag optional, so that we can have
elements of the DOM tree that just group other elements
together.

Okay, here's the output of the code on the right.

+OUTPUT dom.p6

}

c {
+CODE dom.p6
}
}

Let's put these two together and generate some HTML.

r {
c {
To generate something, we make an "actions" class -- a
class whose methods have the same names as
the rules in the grammar. When the grammar matches a rule
against some portion of the input, the corresponding method
in the actions class is called.

The argument that comes in, `$/` is the match object --
which references the current text that was matched. It has a little stash
that can be accesed by calling `.make` (to set a value)
or `.made` (to get a value). In our case we are making a
dom tree so we will be sending `Node` objects to `.make`
and retrieving them with `.made`.

Again -- the output is below the code on the right.

So, okay, maybe that was more work than writing HTML.
But the input was this:

```
A paragraph.

A paragraph that
has two lines.
```

And now we can have some fun and make our language a bit
better.
}

c {
+CODE firstactions.p6

+OUTPUT firstactions.p6
}
}

First let's do some basics like *bold* or `monospace`.

r {
c {
We break up our line into phrases, break up our phrases into
characters.

+OUTPUT cute.p6

And use these new definitions to construct DOM nodes.
}

c {
+CODE cute.p6
}
}

I'm going to skip some of the boring stuff of building new nodes etc,
and instead fast forward to an interesting part: making nested tags with
lists of classes and ids, and assigning them to aliases.

r {
third=div.w3-col,m4,s12 {
Here's the source.

div.w3-black,w3-padding,w3-margin {
+INCLUDE how.mt
}

}

third {
Here's the rendered HTML.

+OUTPUT mt how.mt
}

third {
And here's the relevant part of the parser.

+CODE snippet.p6
}
}

Well, for more details, head and over to the
github <repository:https://github.com/bduggan/markatu>.
There you can find the source, a test suite with lots
of examples, as well as `mt` -- a command line tool
for converting files from markatu into HTML.

h3: Conclusions

ul.w3-padding {

* Parsing and inventing languages can be fun.

* Perl 6 Grammars provide nice building blocks for experimenting with languages.

* Lightweight markup languages can be programmer-friendly.

}

^quine

Here is the source for this blog entry:

example {
+INCLUDE index.mt
}