SSA is an intermediate program representation in which each variable could only be defined once and become read-only afterwards. It is invented to ease many program analysis techniques, such as dataflow analysis. Construction of def-use chain and DAG for variables in SSA form is simple and straight forward.
x = 1
y = 2
z = x + y + 1
w = 2 * x
if (condition) z = 2 * w // forbidden in SSA-form
Spark RDD follows SSA form internally, as each RDD is an immutable representation of a distributed dataset.
x = RDD() // some constant
y = RDD() // some constant
z = x join y
w = x map func
if (condition) z = w map func // non-SSA assignment
Current Spark implementation in Scala/Java has no restriction on variable mutability, neither in its source or byte-code. If we intend to perform caching analysis in SSA form, it would be required to transform either source or byte-code into SSA form before analysis, and reversely transform the SSA-form into source or byte-code afterwards.
Primarily, converting to SSA form requires replacing the target variable of each assignment to a new variable, similar to recording its version. Then, each use of the variable will be replaced with the versioned variable that is able to reach the use site.
x_0 = RDD() // some constant
y_0 = RDD() // some constant
z_0 = x_0 join y_0
w_0 = x_0 map func
if (condition) jump to true else jump to false
true:
z_1 = w_0 map func // z_1 is a new version of z if condition holds
jump to end
false:
jump to end
However, for control flow constructs that involves branching and looping, the referred version of a variable use could depend on the branching condition. A special function φ is introduced to handle this situation, similar to the conditional ternary assignment operator ?:
in C and many other languages. To convert arbitrary control flow into SSA containing φ node, one can use dominant frontier algorithm that detects the immediate site that the variable version cannot be sufficiently determined for a variable use. The intermediate representation of Lower Level Virtual Machine (LLVM) project follows the SSA form, and it supplies a common transformation pass mem2reg
that converts read/write of stack variables into register values which are assigned only once.
end:
z_2 = φ(z_0, z_1) // z_2 takes value depending on which branch we traversed