Published on

PPXLib is not just for transforming code

Authors
  • avatar
    Name
    Chris Armstrong
    Twitter

If you're familiar with OCaml development, you would have encountered PPXs, little plugins you can add to your project to generate or transform your code in some way by adding attributes next to the target type or module. They're an essential part of OCaml development, commonly used when you need to generate serialisers or pretty-printers1.

Most ppxs are written with ppxlib, which simplifies the targeting of attributes in the AST2 and smoothing over differences between different compiler versions.

I personally hadn't thought of using ppxlib, as I thought ppx creation was meant for advanced library authors. I've been working on smaws, an OCaml-based SDK for AWS which involves a lot of code generation.

smaws's code generation pipeline is:

  1. read AWS service definitions from Smithy JSON files
  2. build up an internal representation
  3. emit the OCaml code as strings and writing them to a file.

However, someone recently pointed out to me that I could use ppxlib for the code generation part. The idea intrigued me, but I was focused on other aspects at the time and considered a bit of a distraction, but circumstances brought me back to the codebase and I thought I'd give it a go.

Why use ppxlib for code generation instead of just emitting strings?

Emitting code from strings is relatively easily to get started with and works well: you can review the output by hand, and validate it is syntactically correct with the compiler. The OCaml compiler is fast too, so the feedback loop is quick.

However, as you start to break down your code generation into multiple functions and combine them together, it's easy to lose track of what you're generating and begin causing bugs. Like a lot of the problems with untyped languages, little mistakes go unnoticed until runtime, and you can waste a lot of your time with debugging.

PPXlib lets you generate an AST for OCaml code directly. This means you can use the OCaml compiler to not just validate your output, but also the output of each portion of code you emit, because the AST itself is strongly typed and limits you (mostly) to generating valid OCaml code.

It saves you from needing to write a lot of tests to validate the strings you emit, and you can be more confident as you refactor the code that you're not breaking anything.

Setting up a project for code generation

Like when emitting strings, there are three major parts to your generation pipeline:

  1. Building that generates the AST and writes it to a file
  2. Running the AST generator
  3. Compiling the generated code with the rest of your project

Setting this up is much easier with dune, the predominant OCaml build system, so we'll use that.

Firstly, a rule for compiling your generator:

(executable
 (name generate_code)
 (libraries ppxlib)
 (preprocess (pps ppxlib.metaquot)))

Then, a rule for running your generator:

(rule
  (mode promote)
 ; :gen is aliased to the name of the generator executable
 (deps
  (:gen generate_code.exe))
 (targets
  output_file.ml)
 (action
  (run
   %{gen}
   -output
   output_file.ml)))

The above rule runs generate_code.exe with -output output_file.ml to specify the target.

Note that we use (mode promote) - this ensures that output_file.ml will be copied into our source tree. This is helpful for us to inspect the file output. It's also useful if you want to distribute your generated code as a library, as it will be included in the release tarball by default.

Finally, a rule for compiling the generated code:

(library
 (name my_generated_lib)
 (modules
  output_file))

Writing the generator

The generator itself needs to be able to create an AST and write it to a file. Here's a simple example that we'll break down:

open Ppxlib

(* Alias the Ast_builder module for less typing *)
module B = Ast_builder.Make(struct let loc = Location.none end)

(* Generate a simple let binding *)
let generate_code fmt =
  (* an expression of a constant integer "6" *)
  let expr = B.pexp_constant (Pconst_integer ("6", None)) in
  (* a pattern (the left hand side of a let binding) with a variable called "my_var" *)
  let pattern = B.ppat_var (Location.mknoloc "my_var") in
  (* the let value binding i.e. my_var = 6 *)
  let binding = B.value_binding ~pat:pattern ~expr in
  (* the full let expression i.e let my_var = 6 *)
  let let_expr = B.pstr_value Nonrecursive [ binding ] in
  [let_expr]

let () =
  let filename = Array.get Sys.argv 1 in
  let oc = open_out filename in
  let fmt = Format.formatter_of_out_channel oc in
  let structure_items = generate_code fmt in
  Ppxlib_ast.Pprintast.structure fmt structure_items;
  Format.pp_print_flush fmt ();
  close_out oc;

This code generates a simple let binding let my_var = 6 and writes it to a file specified on the command line.

The first key take-away from the above is the idea of a structure_item, which is basically any top-level construct such as a let-binding, type definition, module definition, class definition, etc. A list of structure_items is a structure, which is what is expected by the Pprintast.structure function for printing the elements of your program.

Simplifying AST construction with metaquot

This was a fairly verbose way of generating a simple let binding, but thankfully for these simple cases, we can use a PPX for our generator called ppxlib.metaquot, which allows us to write OCaml code directly in our generator.

Here's the same code using ppxlib.metaquot:

(* declare a code location variable (`loc` needs to be in scope for metaquot,
   but because we are not writing a transforming ppx, we don't need meaningful
   values for it *)
let loc = Location.none

let generate_code fmt =
  (* a single structure_item for the let binding *)
  let let_expr = [%stri
    let my_var = 6
  ] in
  [let_expr]

The [%stri ...] is used to generate a single structure item. There are other metaquot variants as well:

  • [%str ] for generating a list of structure_item expressions (a 'structure')
  • [%expr ] for generating an expression e.g. [%expr List.map (fun x -> x + 1) [1;2;3]]
  • [%pat ] for generating a pattern (i.e. something on the left-hand side of a let binding or match case) e.g. [%pat (Some x)] (a destructured option) or [%pat my_var] (a variable name)

Combining metaquot with Ast_builder

Usually we use a combination of ppxlib.metaquot and the Ast_builder module to generate code, as metaquot can really only work with static expressions. However, we can combine the two so that we can insert Ast_builder generated nodes into our metaquot expressions.

Let's say our constant from above needs to be determined by some input to our generator. We don't have to generate the whole let binding in Ast_builder if we use [%e <var_name>] to insert the dynamic <var_name> into the metaquot expression:


let generate_code fmt int_value =
  let int_expr = B.pexp_constant (Pconst_integer (int_value, None)) in
  let let_expr = [%stri
    let my_var = [%e int_expr]
  ] in
  [let_expr]

[%e ] is used to insert an expression as an anti-quotation. There are also other anti-quotations:

  • [%p ...] for patterns
  • [%t ...] for types
  • [%m ...] for modules
  • [%%i ...] for structure items themselves (note double-%)

(see Anti-Quotations in the ppxlib documentation for more details and examples).

Working out what AST to generate

OCaml's AST (internally called Parsetree) is quite dense, with lots of nested nodes and metadata. Although the Ast_builder has helper functions for common constructions, much of it assumes familiarly with the AST itself.

AST basics

It's first worth learning the AST's terminology and naming conventions, because there are numerous abbreviations and terms which aren't always the same as what is normally used when describing OCaml:3

PartDescriptionExamplesParsetree constructor
patternthe left hand side of a let binding or match clause, or a function argumentmy_var in let my_var = 6 or Some (x) in match something with Some (x) -> ...Ppat_* e.g. Ppat_var for variables or Ppat_tuple for tuples
expressionthe right hand side of a let bindinge.g. 6 + 7 in let x = 6 + 7Pexp_* e.g. Pexp_function for a function or Pexp_try for a try-catch clause
structure itema top-level component of the AST in an implementation (.ml) filetop-level let binding, module declaration, class declaration, ...Pstr_* e.g. Pstr_module for a module definition, Pstr_attribute for an attribute
signature itema top-level component of the AST in an modulefunction signature, module signaturePsig_* e.g. Psig_class for a class
type declarationstype declaration (specifically the left-hand side). The type expression is added as the "manifest" of the type declarationtype x = ...Ptype_*
type expressionthe expression describing a type (the right-hand side of a type declaration, added as the manifest)int * string, a -> b, < field1: int; field2: string; ... >Ptyp_* e.g. Ptyp_object for an object type, Ptyp_constr for an ADT
module expressiona module expression (i.e. the anonymous struct...end bit)struct ... endPmod_* e.g. Pmod_structure
constanta string or integer constant expression1 or "hello"Pconst_*
function parameterthe parameters of a function implementationfun arg1 arg2 -> ...Pparam_val
function bodythe expressions or case matches that make up a function bodylet x = 6 in x + 2 or function Some x -> x | None -> 2Pfunction_*

You can find the Ast_builder constructor functions from their Parsetree equivalent, usually just in lowercase e.g. Ast_builder.pexp_constant generates an expression with Parsetree.Pexp_constant (a pattern matching a constant value).

All of the Parsetree types are documented alongside their equivalent syntax in parsetree.mli.

Using the compiler to generate the AST

If you're unsure what the Parsetree equivalent is, you can get the compiler to generate it with the -dparsetree option. For example:

my_test_file.ml


module Test = struct
  let x = 9
  type yes_no = Yes | No of string
end

let result = Test.x + 8
$ ocamlc -dparsetree my_test_file.ml
[
  structure_item (my_test_file.ml[3,2+0]..[6,70+3])
    Pstr_module
    "Test" (my_test_file.ml[3,2+7]..[3,2+11])
      module_expr (my_test_file.ml[3,2+14]..[6,70+3])
        Pmod_structure
        [
  ...

(the output can be rather verbose for even short constructions, so consider breaking it up into smaller components)

There is also the dumpast tool in the ppx_tools package, but it appears to be only compatible with ocaml < 5.1.0.

Using LLMs to write AST

Once you have a good understanding of the terminology, querying LLMs for advice on common node constructions becomes much easier to prompt. Although OCaml is less frequent in training data (and more likely to produce hallucinations), ppxlib has been around for a long time and there is numerous libraries that use it.

Further reading

Footnotes

  1. PPXes emit code in the context of, and with reference to, other OCaml code, mostly using the type information to generate related functions such as serialisers / deserialisers for JSON or XML, or pretty-printers (as OCaml lacks enough runtime type information to do this dynamically).

  2. attributes are used to target a particular PPX at a particular type or module instead of scanning and transforming the whole AST

  3. I've omitted several of the Parsetree types relating to classes, objects and extension points - refer to parsetree.mli for all of the parse tree options