MinCaml compiler project | Ouassim Boulkamh

A compiler project in Java with parsing, type checking, normalization, optimizations, ASML, and ARM-oriented backend work.

A compiler is the clearest possible demonstration that software has precise semantics. Every transformation either preserves what the program means or it does not, and there is no ambiguity in telling the difference — run the tests. The MinCaml compiler project forced that discipline across ten or so pipeline stages, implemented in Java 17 using JFlex for lexing and CUP for parsing, built and tested with Maven.

MinCaml is a minimal ML-family functional language. The goal was not to produce a competitive compiler, but to implement the full pipeline from source text to ARM assembly, understanding each stage well enough to explain what invariant each pass maintains.

The pipeline as a series of transformations

The compiler is invoked as ./mincamlc [options] <source file>, where each option activates one additional pass:

-p   print the AST
-t   type checking
-k   K-normalization
-a   alpha conversion
-b   beta reduction
-l   let reduction
-c   constant folding
-e   eliminate unnecessary definitions
-cc  closure conversion
-asml  generate ASML output

Each flag is a checkpoint. Running -p prints the parse tree and stops. Running -t runs the type checker and reports type errors. Running all flags together produces ARM-targeted ASML. This incremental design made debugging tractable — when something broke, the flag that introduced the problem identified which pass was responsible.

Parsing: JFlex and CUP

JFlex generates the lexer from a regular grammar specification. It produces a Java class that tokenizes the input stream, consuming characters and emitting tokens like LET, IN, IF, FUN, ADD, integer literals, and identifiers.

CUP generates the parser from a context-free grammar. The grammar describes the syntax of MinCaml expressions: let bindings, function application, arithmetic, conditionals, and tuples. CUP produces an LALR(1) parser, which means it uses one token of lookahead to decide which production rule to apply. Shift/reduce and reduce/reduce conflicts in the grammar are a sign that the grammar is ambiguous in ways LALR(1) cannot handle — resolving them required careful attention to operator precedence and associativity declarations.

The output of parsing is an AST. Every node in the tree corresponds to a syntactic construct: LetNode, IfNode, AppNode (function application), BinOpNode, and so on. The -p flag dumps this tree.

Type checking: Hindley-Milner

Type checking in MinCaml uses Hindley-Milner type inference, which allows the compiler to infer the types of expressions without explicit type annotations. The implementation splits into two classes: GenEquations and Unifier.

GenEquations walks the AST and generates type equations. For a let x = e1 in e2 expression, it generates the equation type(x) = type(e1) and adds the binding to the type environment before inferring the type of e2. For a function application f a, it generates type(f) = type(a) -> τ for a fresh type variable τ, which is the return type of the application. This produces a set of constraints, each asserting that two type terms must be equal.

Unifier solves those constraints by unification. The algorithm processes constraints one at a time. If two type terms are both type constructors (like int or bool), they must be identical. If one side is a type variable, substitute it throughout the remaining constraints. If two function types a -> b and c -> d must be equal, add constraints a = c and b = d and continue. The algorithm fails — reporting a type error — if it ever needs to make a type variable equal to a term that contains it (occurs check), which would produce an infinite type.

The result is a substitution mapping type variables to their inferred types. Applying this substitution to any node’s inferred type gives the concrete type for that expression.

K-normalization

K-normalization converts the AST into a form where every subexpression that computes a value is explicitly named. The pass in KNormalizer.java ensures that function arguments and operator operands are always variable references, never inline expressions. An expression like f (g x + h y) becomes:

let t1 = g x in
let t2 = h y in
let t3 = t1 + t2 in
f t3

This is called K-normal form (after continuation-passing style’s K combinator). It makes subsequent passes simpler because every operation’s operands are already in a simple form — no need to recursively handle nested expressions inside optimizations.

Alpha conversion

AlphaConversionVisitor.java gives every variable binding a globally unique name. In MinCaml, the same identifier can be rebound in nested scopes. Alpha conversion distinguishes them:

let x = 1 in
let x = x + 1 in  (* same name, different scope *)
x

becomes:

let x_0 = 1 in
let x_1 = x_0 + 1 in
x_1

This is essential for subsequent optimizations. Beta reduction, for example, needs to substitute variables safely without worrying that a substituted name will accidentally capture a different binding in the target scope.

Optimizations: beta reduction, let reduction, constant folding, dead code elimination

Beta reduction (BetaReducer.java) inlines function calls where the function is immediately applied to arguments. (fun x -> x + 1) 5 becomes 5 + 1. This eliminates function call overhead and creates new opportunities for constant folding.

Let reduction (LetReduction.java) simplifies let-bindings where the bound variable is used only once or where its value is a trivial expression. let x = y in x becomes just y.

Constant folding (ConstFold.java) evaluates expressions whose operands are compile-time constants. 3 + 4 becomes 7. if true then e1 else e2 becomes e1. This runs after beta reduction and let reduction because those passes often create new constant expressions that were not visible in the original code.

Unnecessary definition elimination (ElimUnnecessaryDef.java) removes let-bindings whose bound variable is never used. After the previous optimizations, some bindings become dead. This pass finds them by traversing the expression and noting which variables appear in the body, then removing bindings for variables that do not.

These four passes interact. Constant folding after beta reduction is more effective than either alone. The order matters, and running multiple rounds would potentially find more opportunities — the implementation runs each pass once.

Closure conversion

ClosureConversion.java is the most structurally significant transformation. MinCaml allows functions to capture variables from their enclosing scope. The closure conversion pass makes those captured variables explicit so that functions can be represented as simple function pointers at the assembly level.

Before closure conversion, a function like:

let add_n = fun x -> x + n in
add_n 5

captures n implicitly. After closure conversion, the function is paired with an explicit record of its free variables:

let add_n = <fun x env -> x + env.n, {n = n}> in
add_n 5

The function pointer now takes an explicit environment argument alongside its regular parameters. At the call site, the environment is extracted from the closure pair and passed through.

ASML generation

ASMLGeneratorVisitor.java converts the closure-converted intermediate representation to ASML (Abstract Stack Machine Language), an intermediate representation targeting a register machine with explicit stack operations. The ASML node hierarchy mirrors the structure of the IR: ASMLLet, ASMLLetFun, ASMLIf/ASMLCond, ASMLCall, ASMLCallClosure, ASMLArith, ASMLNeg, ASMLVar, ASMLInt, ASMLAssign, ASMLNew, ASMLMem, ASMLStore. Each node type captures one kind of ASML instruction.

The known bugs here are honest: Array, Get, and Put for array operations were not implemented. The ASML generator also becomes less reliable for complex programs with many nested closures. These are real gaps, documented in the README:

Implémentation de Array, Get (pour les array), Put (pour les array). Amélioration de la génération de l’ASML pour des programmes plus complexes.

Register allocation

The backend package contains a full implementation of the Iterated Register Coalescing algorithm from George and Appel (1996). This is one of the most algorithmically substantial parts of the project.

The input to register allocation is a set of ASML Instruction objects, each annotated with def (registers written) and use (registers read) sets. CFGBuilder.java constructs the control flow graph of BasicBlock nodes, each with a successor set.

Liveness analysis runs backward over the CFG until fixed point:

do {
    changed = false;
    for (int i = nodes.size() - 1; i >= 0; i--) {
        BasicBlock node = nodes.get(i);
        Set<Node> newOut = new HashSet<>();
        for (BasicBlock successor : node.successors) newOut.addAll(successor.in);
        Set<Node> newIn = new HashSet<>(node.use);
        Set<Node> tempOut = new HashSet<>(newOut);
        tempOut.removeAll(node.def);
        newIn.addAll(tempOut);
        if (!newOut.equals(node.out) || !newIn.equals(node.in)) { changed = true; ... }
    }
} while (changed);

The equations are standard: out[b] = ∪ in[s] for all successors s; in[b] = use[b] ∪ (out[b] - def[b]). The iteration stops when neither set changes for any block.

RegisterAllocator.java then implements the full Chaitin-Briggs pipeline. The allocator is parameterized by K, the number of available physical registers. After the Build phase constructs the interference graph (adding an edge between every pair of simultaneously live variables), MakeWorklist partitions nodes into three categories: spill (degree ≥ K), freeze (move-related, degree < K), and simplify (not move-related, degree < K).

The main loop iterates: simplify low-degree nodes onto the select stack; coalesce move-related nodes if they pass the Conservative test; freeze move-related nodes that cannot be coalesced; and select a spill candidate from high-degree nodes. Coalescing uses the Briggs criterion: two non-precolored nodes can be merged if the combined node would have fewer than K neighbors with degree ≥ K. For precolored nodes, George’s criterion applies instead: every neighbor of the non-precolored node must be either low-degree, precolored, or already adjacent to the precolored node.

} else if ((u.status == NodeStatus.PRECOLORED && AdjOK(v, u))
    || (u.status != NodeStatus.PRECOLORED
        && Conservative(union(Adjacent(u), Adjacent(v))))) {
    coalescedMoves.add(m); Combine(u, v); AddWorkList(u);

AssignColors then processes the select stack in reverse order. A node gets any color not used by its already-colored neighbors. If no color is available, it is a real spill and must be allocated to memory. The allocator then replays from the top with spilled nodes rewritten as load/store pairs. TestRegisterAllocator.java provides unit tests for the liveness and allocation logic.

The ARM assembly generator (ARMAssemblyGenerator.java) takes the colored interference graph and emits ARM instructions, using the register assignment from the allocator to map virtual registers to physical ARM registers.

Why this matters for security work

Compilers are relevant to security in two directions. The first is static analysis. The techniques used here — AST traversal, constraint generation, data flow reasoning, fixed-point iteration over control flow graphs — are the same techniques used in static analysis tools for finding bugs in code. Understanding how a compiler transforms programs makes it easier to understand what a static analyzer sees and why it produces false positives or misses real bugs.

The second direction is secure development. Implementing a multi-pass system where each pass must preserve program semantics is a strong exercise in invariant thinking. What does this function guarantee about its output? What does the caller of this function get to assume? Those questions are the same ones you ask when reviewing code for security properties, and the habit of asking them precisely does not come from reading about it.

The test suite matters here too. Some modules had full test coverage that caught regressions when a later optimization broke an earlier invariant. Others had minimal tests, which made those passes a source of uncertainty. The lesson was not that tests are important — that is obvious — but that the absence of tests for a specific transformation is a specific and quantifiable risk, not a vague concern.