STALLED: Boxing and wrappers

Background

An unboxed representation is a non-uniform representation that coincides with the “best” machine representation but cannot be operated polymorphically over, without generating code for every representation. The following function has an unboxed signature:

void f(Tensor input, float x);

A boxed representation is a uniform representation which can be operated over polymorphically with a single piece of code, without any code generation. The following function has a boxed signature:

void f(Context ctx) {
  Tensor input = ctx.input<Tensor>(0);
  float x = ctx.input<float>(1);
  ...
}

// Another way to implement is using a stack with polymorphic operators

A wrapper is some code which wraps an existing operation, giving it extra functionality in the process. Both PyTorch and Caffe2 make extensive use of wrappers:

In Caffe2, the “wrapper” is the non-virtual Run() function defined in the base Operator class, which calls a virtual RunOnDevice() function. This wrapper is responsible for:
- Starting/stopping observers (Caffe2's general purpose application of the observer pattern; various implementations do things like turn on/off performance counters, count run counts, etc)
- Switching the current device (cudaSetDevice) to the one appropriate to run the computation
- In RunAsync(), manage event triggers
In PyTorch, the “wrapper” is code-generated code which generates the code from the public-facing API to the actual operator implementation. The generated code includes:
- Starting the per function profiler (profiler::RecordFunction) [implemented on Variables only]
- Unpacking variables (which are gradient recording) to their tensors (which are not gradient recording), so we can call the underlying implementation on them, and then rewrap their result into a gradient recording variable with appropriately added history
- Tracing execution, used by the ONNX exporter and the JIT tracer [implemented on Variables only]
- Increment “version counters” upon mutation, to conservatively track if a tensor is mutated under you
- (Legacy) Selection of which Tensor argument to perform dispatch on (ATen is single dispatch)
- (Legacy) Performing checked casts of at::Tensor into a refined type (e.g., CUDAIntTensor) which we subsequently pass to the TH backend (which is not polymorphic)
Though not done in either framework at the moment, there is also a reasonable argument to be made the wrappers should be used to implement error checking and handling in a uniform way (making it impossible to forget to test for an error condition)

Challenge

There are two primary benefits to unboxed representation:

It allows for a more natural style when writing operators. Indeed, even when one writes an operator implementation in a boxed world, the very first thing you do is unbox the arguments (“Tensor input = in→input(0);”) so that you can work with them in C++, unboxed land.
It is marginally more efficient, since doing doing less work is always faster than doing more work (type testing, unboxing, etc).

However, code that works with unboxed representations has a fatal flaw: it must be code generated separately for every argument specification (e.g., f(int), f(int, float), f(float, int)...) that you might want to work with. This requirement is a deal breaker for extensible use-cases, as one need to exhaustively list all possible function signatures that one may encounter in any extensions (or, one need to recompile for extensions) and it also leads to very unpleasant compile times when the amount of generated code (either via codegen or templates) explodes.

Proposal

We propose the following:

ALL invocations of the operator library, no matter if they are from the graph frontend or directly with the C++ frontend, will pass through a boxed representation. This means that we can rely on being able to operate polymorphically over all operators in wrapper code, letting us avoid code generation in certain cases.
We propose that operator definition sites continue to accept arguments in unboxed form, to insulate us against implementation details of the boxing implementation, and to give a more pleasant end-user interface.

Threats

There is now a miniature stack machine runtime which all C++ function calls are going through before they get to the final C++ function call (as opposed to a more transparent scheme where there's a bunch of code that gets executed before you get to the operator definition execution.)
There is a performance cost to boxing in this way.

ezyang/boxing.md

Select an option

No results found