Skip to content

Instantly share code, notes, and snippets.

@byronyi
Created December 13, 2016 11:11
Show Gist options
  • Save byronyi/c91ffc7e5776639f046d92595cde0250 to your computer and use it in GitHub Desktop.
Save byronyi/c91ffc7e5776639f046d92595cde0250 to your computer and use it in GitHub Desktop.

RDD and Single Static Assignment (SSA) Form

SSA is an intermediate program representation in which each variable could only be defined once and become read-only afterwards. It is invented to ease many program analysis techniques, such as dataflow analysis. Construction of def-use chain and DAG for variables in SSA form is simple and straight forward.

x = 1
y = 2
z = x + y + 1
w = 2 * x
if (condition) z = 2 * w  // forbidden in SSA-form

Spark RDD follows SSA form internally, as each RDD is an immutable representation of a distributed dataset.

x = RDD()  // some constant
y = RDD()  // some constant
z = x join y
w = x map func
if (condition) z = w map func  // non-SSA assignment

Current Spark implementation in Scala/Java has no restriction on variable mutability, neither in its source or byte-code. If we intend to perform caching analysis in SSA form, it would be required to transform either source or byte-code into SSA form before analysis, and reversely transform the SSA-form into source or byte-code afterwards.

Primarily, converting to SSA form requires replacing the target variable of each assignment to a new variable, similar to recording its version. Then, each use of the variable will be replaced with the versioned variable that is able to reach the use site.

x_0 = RDD()  // some constant
y_0 = RDD()  // some constant
z_0 = x_0 join y_0
w_0 = x_0 map func
if (condition) jump to true else jump to false
true:
z_1 = w_0 map func  // z_1 is a new version of z if condition holds
jump to end
false:
jump to end

However, for control flow constructs that involves branching and looping, the referred version of a variable use could depend on the branching condition. A special function φ is introduced to handle this situation, similar to the conditional ternary assignment operator ?: in C and many other languages. To convert arbitrary control flow into SSA containing φ node, one can use dominant frontier algorithm that detects the immediate site that the variable version cannot be sufficiently determined for a variable use. The intermediate representation of Lower Level Virtual Machine (LLVM) project follows the SSA form, and it supplies a common transformation pass mem2reg that converts read/write of stack variables into register values which are assigned only once.

end:
z_2 = φ(z_0, z_1)  // z_2 takes value depending on which branch we traversed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment