Need help with zebu?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

modernserf
422 Stars 4 Forks ISC License 192 Commits 1 Opened issues

Description

A compiler for little languages in tagged template strings

Services available

!
?

Need anything else?

Contributors list

No Data

Zebu

What is this?

Zebu is a JavaScript library for building little languages with tagged template literals.

Why would I want to do that?

When we work with code, we don't just care about performance; we also care about human-centric concerns like convenience, elegance, and readability. In most cases, the human-centric concerns take priority: any time you have chosen to use a framework over "vanilla JS", or a high-level language over a low-level one, you have implicitly chosen developer experience over performance. Everyone has different thresholds for when to make this trade-off, but in general we recognize that the programer's time is more valuable than the computer's.

While most programmers are comfortable writing code for other programmers to use, whether that's via the open source ecosystem or just in your project's

utils.js
file, very few are comfortable designing or contributing to a programming language itself. Most programmers have opinions about programming languages, and may even have ideas for features they'd like in a programming language, but few even consider that they could make these ideas a reality. Implementing a programming language seems like it belongs to the category of software that's beyond the reach of ordinary programmers, alongside databases and operating systems.

Implementing a general-purpose, high-performance programming language is, indeed, a lot of work. But there is a huge spectrum of possibility between "library that is pleasant to use" and "industrial strengtth programming language". Many interesting and useful languages are (relatively) simple and are more like features of a language than a language in and of themselves -- examples include regular expressions, DOM selectors and date format strings. (You may have invented a "little language" like this without even realizing it -- any function that takes a string and does something based on the contents of that string is, in some sense, an interpreter for a programming language.) These languages don't need to implement many of the features we take for granted in programming languages -- variables, functions, etc -- because all of that is already implemented by the "host" language; they can focus on doing a single specialized task using an appropriately specialized syntax.

Zebu is a toolkit for building these little languages that handles the boring and error-prone parts (ie. turning a string into structured data) so you can focus on providing a great developer experience.

How is Zebu different from similar tools?

Zebu is a parser generator that broadly resembles tools like Yacc, Bison or ANTLR. It works with LL(k) grammars, though like ANTLR it can also handle direct left recursion. Like the aforementioned tools, but unlike PEG and parser combinator libraries, Zebu has separate lexing and parsing phases.

The major difference between Zebu and other parser generators is that Zebu applies the principle of "convention over configuration" to parsing. Specifically, all languagees created with Zebu use these lexing rules:

  • whitespace (including newlines) and JavaScript-styled comments (
    //
    and
    /* */
    ) are ignored
  • numbers, strings, and identifiers are tokenized with the same syntactic rules as JavaScript

In other words: Zebu is not a general-purpose parser generator; languages created with Zebu will necessarily have a strong family resemblance to JavaScript, even if they are structured very differently. However, we believe what you give up in terms of expressivity are more than made up for by improved developer and user experience -- so many irrelevant but potentially confusing differences can be avoided completely,

How does Zebu work?

Let's walk through a simple example first, to cover the core principles, and then we can get to more useful examples. Here's how you create a language:

import { lang } from 'zebu';

const add = lang Expr = Expr "+" value : ${(left, _, right) => left + right} | value;;

And then you use the lanugage like this:

const result = add`1 + ${2} + 3`;
console.log(result); // 6

So what is actually happening here?

lang
is a function being used as a tagged template literal.
lang`
  Expr = Expr "+" value : ${(left, _, right) => left + right}
       | value;
`;
// is equivalent to
lang(
  ['\n  Expr = Expr "+" value : ', '\n       | value;\n'],
  (left, _, right) => left + right
);
`

Likewise:

add`1 + ${2} + 3`;
// is equivalent to
add(['1 + ', ' + 3'], 2);

Tokenizing

Now, what is happening inside

add
? (We'll cover what's happening in
lang
in the next section.) First, the strings and interpolations passed into
add
are turned into an array of tokens. This process interleaves the strings and interpolations back together, and strips out any whitespace or comments:
[
  { type: 'value', value: 1 },
  { type: 'literal', value: '+' },
  { type: 'value', value: 2 },
  { type: 'literal', value: '+' },
  { type: 'value', value: 3 },
];
  • value
    - Numbers (decimal, hexidecimal, octal or binary), quoted strings (single or double quote), and interpolated values (of any type) are wrapped in
    value
    tokens.
  • literal
    - Values that are explicitly enumerated in the definition of a language will be wrapped in
    literal
    tokens. For example, the definition of
    add
    includes a
    "+"
    , so
    +
    is matched as a
    literal
    token. These are typically be used for keywords, operators, or other 'punctuation' in programming languages.
  • identifier
    - Values that match the identifier rules for JavaScript, and which are not explicitly enumerated in the language's definition, are matched as
    identifier
    tokens.

Any text that is not ignored as whitespace or comments and does not match any of the above token types will throw an error. For example, in

add`1 - 2`
,
"-"
is not in
add
's definition and therefore is not identified as a literal, so this will throw.

Parsing

Now that we've converted the input into tokens, let's match the tokens to the grammar in our language definition. Let's look at that again:

lang`
  Expr = Expr "+" value : ${(left, _, right) => left + right}
       | value;
`;

In this grammar, we have a single rule, labelled

Expr
; more complex grammars can have many rules, separated by semicolons. The
Expr
rule has two branches, separated by the
|
operator. The first branch matches a sequence -- first, it recursively matches itself, then it matches the token
{ type: 'literal', value: '+' }
, and then it matches any token of the type
value
. The results of each of these are passed into the function to the right, the result of which becomes the overall result of the expression. The second branch matches a single
value
token, and the result is the value in that token.

When we apply this grammar to

add`1 + ${2} + 3`
, it parses as:
1     +   2     +    3
value                       : 1
Expr  "+" value             : 1 + 2
Expr            "+"  value  : (1 + 2) + 3

And returns the result

6
.

Unlike in regular expressions, parsing must match the whole input;

add`1 +`
or
add` + 2`
would both fail, as would
add`1 + + 2 3`
.

A more complicated example

TODO: something with a couple of rules, more language features, notes on technique (e.g. precedence climbing)

Operator grammars

TODO

Tag helper

Zebu language reference

Literals

A quoted string in a Zebu grammar matches the value of that string and returns that value.

const lit = lang`Main = "foo"`;

equal(litfoo, 'foo');

throws(() => litbar); throws(() => lit"foo");

Rules

Zebu grammars are composed from a list of rules, separated by semicolons. Rules do not have to be defined in any particular order (a rule can reference rules either above or below it), but the top rule is the rule for the whole grammar.

Identifiers in a grammar will match the rule with that name. Zebu will raise an error if there is no rule by that name.

const math = lang`
  Neg   = "-" Expr : ${(_, value) => -value}
        | Expr;
  Expr  = #( Neg )
        | value;
`;

equal(math123, 123); equal(math-123, -123); equal(math(123), 123); equal(math-(-(123)), 123);

value

The

value
keyword matches a number, a quoted string, or an interpolation, and returns that value.
const val = lang`Main = value`;

equal(val"hello, world!", 'hello, world!'); equal(val'string with \'escaped quotes\'', "string with 'escaped quotes'"); equal(val-123.45e6, -123.45e6); equal(val0xDEADBEEF, 0xdeadbeef);

equal(val${1}, 1); equal(val${'hello, world!'}, 'hello, world!'); const object = {}; equal(val${object}, object);

equal(val"foo${'bar'}baz", 'foobarbaz');

identifier

The

identifier
keyword matches a JavaScript identifier which is not used as a literal in the grammar, and returns that value.
const id = lang`
  Main = identifier;
  Reserved = "class" | "function" | "if" | "else";
`;

equal(idfoo, 'foo'); equal(id$bar, 'bar'); equal(id_0, '_0');

throws(() => idclass);

keyword

The

keyword
keyword (!) matches a literal value that is in the format of a JavaScript identifier and returns that value.
const kw = lang`
  Main = identifier "." (identifier | keyword) 
          : ${(left, _, right) => [left, right]};
  Reserved = "class" | "function" | "if" | "else";
`;

equal(idfoo.bar, ['foo', 'bar']); equal(idfoo.class, ['foo', 'class']);

throws(() => idclass.foo);

operator

The

operator
keyword matches a literal value that is not in the format of a JavaScript identifier. I have no idea why you would use this, but it is included for symmetry with
keyword
.

nil

The

nil
keyword matches nothing and returns
null
.
const opt = lang`Main = identifier | nil`;

equal(optfoo, 'foo'); equal(opt``, null);

include

The

include
keyword allows you to embed one grammar in another. If you use this, it will probably be for embedding an operator grammar.
const prog = lang`
  Program = Statement ** ";";
  Statement = "print" Expr : ${(_, expr) => console.log(expr)}
            | Expr;
  Expr = include ${op`
    left  "+"   : ${(l, r) => l + r}
          "-"   : ${(l, r) => l - r}
    left  "*"   : ${(l, r) => l * r}
          "/"   : ${(l, r) => l / r}
          "%"   : ${(l, r) => l % r}
    right "**"  : ${(l, r) => l ** r}
    pre   "-"   : ${x => -x}
    root BaseExpr
  `};
  BaseExpr = value;
`;

Sequence

A sequence of expressions followed by a colon and an interpolated function matches that sequence and passes the results of each expression into that function, returning the result.

const seq = lang`identifier "=" value : ${(name, _, value) => ({
  [name]: value,
})}`;
equal(seq`foo = 1`, { foo: 1 });
equal(seq`bar = "bar"`, { bar: 'bar' });

throws(() => seqfoo); throws(() => seqfoo = 1 bar); throws(() => seq1 = foo);

Alternation

The pipe character,

|
, like in regular expressions, is the alternation operator, and (Foo | Bar) matches either of the rules Foo or Bar.
const alts = lang`
  Main = "foo" | "bar" | value;
`;

equal(altsfoo, 'foo'); equal(altsbar, 'bar'); equal(alts123.45, 123.45);

A note about parsing strategy

The behavior of this operator happens to be one of the major differences between traditional CFG parsers (including Zebu) and PEG parsers. In Zebu, the order of branches doesn't matter -- the parser looks ahead at the next tokens and chooses the branch based on that. However, this means that each branch must not overlap; if they do, Zebu will raise a "first/first conflict" error. On the other hand, PEG parsers try each branch in order, and backtrack if one doesn't succeed. This means that branches can overlap, though more often than not overlapping branches are an indication of a bug, not a desirable feature.

Repetition

The operators

*
,
+
and
?
work similarly to how they work in regular expressions:
  • Expr*
    matches a sequence of 0 or more
    Expr
    s and returns an array,
  • Expr+
    matches a sequence of 1 or more
    Expr
    s and returns an array,
  • Expr?
    optionally matches
    Expr
    , and returns null if it doesn't match.
    Expr?
    is equivalent to
    (Expr | nil)
    .
const repeat0 = lang`value*`;
equal(repeat0``, []);
equal(repeat0`"foo"`, ['foo']);
equal(repeat0`"foo" "bar"`, ['foo', 'bar']);

const repeat1 = langvalue+; throws(() => repeat1``); equal(repeat1"foo", ['foo']); equal(repeat1"foo" "bar", ['foo', 'bar']);

const maybe = langvalue?; equal(maybe``, null); equal(maybe"foo", 'foo');

Parentheses, brackets and braces

Zebu includes syntactic sugar for matching expressions wrapped in punctuation:

  • #{ Expr }
    matches
    Expr
    wrapped in curly braces and returns the result of
    Expr
    , and is equivalent to
    ("{" Expr "}" : ${(_, result) => result})
  • #[ Expr ]
    as above, for square brackets
  • #( Expr )
    as above, for parentheses.

Separated sequences

The operators

++
and
**
are used for matching sequences with separators, e.g. function arguments separated by commas, or statements separated by semicolons, and return an array of the matched expression (ignoring the separators). The
++
operator matches one or more elements, while the
**
operator matches zero or more. Both allow optional trailing separators.
const sepBy0 = lang`value ** ","`;
equal(sepBy0``, []);
equal(sepBy0`1`, [1]);
equal(sepBy0`1,`, [1]);
equal(sepBy0`1, 2`, [1, 2]);

const sepBy1 = langvalue ++ ","; throws(() => sepBy1``); equal(sepBy11, [1]); equal(sepBy11,, [1]); equal(sepBy11, 2, [1, 2]);

If you explicitly do not want trailing separators to be valid, use something like:

const noTrailing1 = lang`
  Main = value Rest* : ${(first, rest) => [first, ...rest]};
  Rest = "," value : ${(_, value) => value};
`;

throws(() => noTrailing1``); equal(noTrailing11, [1]); throws(() => noTrailing11,); equal(noTrailing11, 2, [1, 2]);

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.