Perl 5's compiled optree serves simultaneously as AST, IR, and execution format. This coupling prevents global optimization -- the peephole optimizer can only see adjacent ops, data flow is implicit in stack operations, and there is no opportunity for dead code elimination, common subexpression elimination, or loop-invariant code motion across basic blocks.
The Chalk compiler project has demonstrated that Perl semantics can be represented in a Sea of Nodes intermediate representation with explicit data flow edges, proper SSA-style control flow (If, Region, Loop, Phi), and hash consing for structural deduplication.
This design describes SoN, a CPAN distribution that provides:
- A standalone Sea of Nodes IR library suitable for representing Perl computations
- An optree-to-SoN translator that walks perl5's compiled optree and produces SoN graphs
- Text rendering and structural comparison tools for testing
The immediate use case is validating Chalk's compiler output against perl5's own compilation of the same source code. The future use case is feeding SoN graphs into optimization passes and an LLVM backend for native code generation.
Following Graal and libFIRM, each Perl operation (add, subtract, concat, method call, etc.) gets its own node class. Type information is carried as metadata ("stamps") on node edges, not baked into node class names.
This was chosen over:
-
Per-category nodes (e.g., one BinaryOp class with an
opfield) -- production SoN implementations don't do this. Per-operation classes enable optimization passes to pattern-match on node type directly. -
Per-operation-x-type nodes (e.g., IntAdd, FloatAdd as separate classes, like HotSpot C2) -- Perl's type lattice is semantic (Int, Num, Str, Scalar), not machine types (i32, i64, f32, f64). Baking the lattice into class names would be awkward and would require changing the class hierarchy as the type system evolves.
The type lattice on SoN edges follows the formal definition from https://pvm.tools/papers/perl-types-formal.html -- a characterization of Perl's latent dynamic type system derived from operational semantics.
Key properties:
- A value belongs to a type if it survives coercion without losing information (syntactic preservation) AND satisfies all operational contracts (semantic fulfillment)
- Subtyping: Int < Num < Str < Scalar
- DualVars are in Scalar but not in Str or Num (correctly captures that optimization is unsafe for them)
- This describes what Perl values already are, not an imposed external type system
Stamps are initially Unknown and refined by type inference passes.
SoN and Chalk are parallel implementations of the same IR design. Chalk does not depend on the SoN dist. The shared contract is the node vocabulary and graph structure, not Perl code.
Chalk produces SoN graphs by parsing Perl source through its Earley parser. SoN::FromOptree produces SoN graphs by walking perl5's compiled optree. Comparing the two validates that Chalk's frontend produces correct graphs.
Future optimization passes and the LLVM target will consume SoN graphs regardless of which frontend produced them.
The distribution ships on CPAN with no core modifications required. The
optree walker uses B:: for most introspection. A thin XS component
exposes PadnameFIELDINFO (field index, field stash, param name) which
B:: does not currently provide. This is needed for feature class field
access nodes.
Every node has:
id-- stable content-based identifier (for hash-consed data nodes) or sequential unique ID (for CFG nodes)inputs-- ordered list of producer nodes (data flow edges)consumers-- reverse edges (use-def chains), maintained automaticallystamp-- type lattice position on the node's output
- Start -- entry point of the computation graph
- Return -- exit point, carries the return value
- If -- conditional branch, produces two control outputs
- Proj -- projects one output from a multi-output node (true/false from If)
- Region -- control flow merge point
- Loop -- loop header with entry and backedge control inputs
- Constant -- compile-time constant with value and stamp
- Phi -- value selection at Region/Loop merge points
Arithmetic: Add, Subtract, Multiply, Divide, Modulo, Power, Negate String: Concat, Repeat, Substr, Index, Length Comparison: NumEq, NumLt, NumGt, NumLe, NumGe, NumNe, NumCmp, StrEq, StrLt, StrGt, StrLe, StrGe, StrNe, StrCmp Logical: And, Or, Not, Defined Bitwise: BitAnd, BitOr, BitXor, LeftShift, RightShift, Complement Assignment: Assign, CompoundAssign Call: Call (with metadata for dispatch kind: direct, method, builtin)
- PadAccess -- lexical variable (pad index, statically known)
- FieldAccess -- class field (field index + class stash, from PadnameFIELDINFO)
- StashAccess -- package variable (stash name + GV name, dynamic hash lookup)
Singleton factory implementing hash consing:
- Data nodes cached by content hash:
"Operation|input_id_1|...|attr=val" - CFG nodes bypass cache, receive sequential unique IDs
- On creation, registers bidirectional use-def edges (producer.consumers += consumer, consumer.inputs += producer)
Walks the optree in execution order (following op_next), maintaining:
virtual_stack-- array of SoN node referencesmark_stack-- PUSHMARK positions for LISTOP argument collectionscope-- immutable variable bindings (pad index -> SoN node), copy-on-write for branchingcontrol-- current CFG nodevisited-- op addresses already processed (merge point detection)
For each op:
- Consult OpMap for stack effect and SoN node type
- Pop inputs from virtual_stack
- Create SoN node via NodeFactory
- Push output onto virtual_stack (if any)
Special handling:
- LOGOP (and, or, cond_expr): snapshot state, walk both paths (op_next and op_other), detect merge point (op address in visited), create Region + Phi for any scope variables that differ
- LOOP (enterloop): create Loop CFG node, install Phi sentinels for in-scope variables, walk body, resolve sentinels with backedge values
- Skip ops: OP_NULL, OP_PUSHMARK (record mark position only), OP_ENTER, OP_LEAVE, OP_NEXTSTATE -- interpreter bookkeeping not represented in SoN
Mapping table from perl opcodes to SoN behavior:
- Stack pop count (fixed or mark-relative)
- SoN node type to create
- Stack push count (0 or 1)
- CFG flag (requires branch/loop handling)
- Skip flag (interpreter bookkeeping, not represented)
Generated from regen/opcodes or hand-maintained. Starts with the
subset Chalk handles, expands incrementally.
use SoN::FromOptree;
my $graph = SoN::FromOptree->translate(\&Some::function);Takes a code reference, obtains the B::CV via B::svref_2object,
walks from $cv->START, returns a SoN::IR::Graph.
Deterministic text format for testing and debugging:
%0 = Start
%1 = Constant(42) [Int]
%2 = PadAccess(targ: 1, name: '$x') [Unknown]
%3 = Add(%1, %2) [Unknown]
%4 = Return(%0, %3)
Node IDs assigned by topological sort for deterministic output. Hash-consed nodes produce identical IDs regardless of frontend.
Takes two SoN::IR::Graph objects, produces a diff:
- Nodes present in one graph but not the other
- Nodes with same operation but different inputs
- Stamp disagreements
Initially strict structural identity. Will loosen to semantic equivalence as optimization passes cause the two frontends to diverge.
Test usage:
my $chalk_graph = Chalk::compile($source);
my $perl_graph = SoN::FromOptree->translate(\&compiled_sub);
my $diff = SoN::Compare->diff($chalk_graph, $perl_graph);
ok($diff->is_empty, 'Chalk matches perl5') or diag($diff->to_text);Exposes two functions that B:: does not provide:
SoN::FieldInfo::is_field($padname)-- returns whether this padname is a class fieldSoN::FieldInfo::field_info($padname)-- returns (fieldix, fieldstash, paramname, default_flags) for class fields
Implemented against struct padname_fieldinfo from pad.h. Only XS in
the distribution.
SoN/
lib/
SoN/
IR/
Node.pm
NodeFactory.pm
Graph.pm
Stamp.pm
Node/
Start.pm, Return.pm, Region.pm, If.pm, Proj.pm, Loop.pm
Phi.pm, Constant.pm
Add.pm, Subtract.pm, Multiply.pm, ... (~50-70 operation nodes)
PadAccess.pm, FieldAccess.pm, StashAccess.pm
Call.pm
FromOptree.pm
FromOptree/
StackSim.pm
OpMap.pm
Render/
Text.pm
Compare.pm
xs/
FieldInfo.xs
t/
cpanfile
Requires perl 5.42.0+.
This design establishes the SoN IR as the shared contract between frontends (Chalk's parser, perl5's optree) and backends (optimization passes, LLVM target). The immediate deliverable is the optree translator and comparison tool. Future work:
- Optimization passes operating on SoN::IR graphs -- DCE, CSE, LICM, strength reduction. Shared between Chalk and perl5 pipelines.
- Type inference using the formal lattice to refine stamps from Unknown to specific types, enabling coercion elimination.
- LLVM target lowering typed SoN graphs to LLVM IR for native code generation. Shared between both pipelines.
The pipeline:
Chalk parser ------> SoN::IR --> Optimization --> LLVM --> machine code
^
perl5 optree -----> SoN::FromOptree
(toke.c/perly.y)