Skip to content

Instantly share code, notes, and snippets.

@beam2d
Last active December 22, 2016 06:37
Show Gist options
  • Save beam2d/f7b817c3ce6ada32a1f9c8cc7dea4a5a to your computer and use it in GitHub Desktop.
Save beam2d/f7b817c3ce6ada32a1f9c8cc7dea4a5a to your computer and use it in GitHub Desktop.
Candidate API changes for Chainer v2

Candidate API changes for Chainer v2

In this gist, I list up the possible changes on existing APIs for Chainer v2. It also includes some ideas of adding new APIs related to existing ones. Note that this is not a complete list of future changes for v2.

Core features

Variable

  • __len__ should return the length of the first axis.
  • Remove volatile flag?
    • It can be replaced by no_backprop_mode, which is easier to use.
    • We can also remove Flag
  • Remove zerograd
  • Support uninitialized Variable
    • It can be used for the better support of uninitialized parameters (used for "parameter shape placeholder")
  • Support optimizer for Variable
    • It enables us to choose an optimizer for each parameter variable.
    • We need a design document and discussions on it.

Function

  • type_check_enable: Make it a global/thread-local flag
  • Make type checking enabled only in debug mode by default

Link

  • add_param and add_persistent does not work well with PyCharm
    • An added attribute is not recognized by the IDE
    • It is better to design a new API to avoid this issue
  • add_uninitialized_param should be redesigned to work with uninitialized Variable (see above)

Chain/ChainList

  • add_link: see the above discussions on add_param
  • to_gpu should be applied to links added in the future
  • We want to support duplicated parents of a link (it is currently prohibited)

Optimizer

  • Remove deprecated methods (most of which can be replaced by optimizer hooks)
  • Support per-parameter configurations (see notes on Variable)

Serializer/Deserializer

  • Stop using Abstract Base Class
  • Support non-strict mode that allows the parameter set mismatch to the set of loaded parameters

Updater/StandardUpdater

  • The interface should be updated to support the updated optimizer APIs.

Reporter

  • Support non-scalar observations.

FunctionSet

  • Remove it.

cuda module

  • Remove deprecated APIs.
  • Deprecate get_device() and add alternatives: get_device_from_id, get_device_from_object, etc.
  • to_cpu / to_gpu: support Variable as an input.

Collection of functions, links, and extensions

extensions

  • snapshot/snapshot_object: Remove the trigger option, which is redundant.
  • LogReport: Think a better name of the trigger option.

functions

  • Flags: make them global/thread-local and remove them from arguments
    • use_cudnn
    • train/test
    • deterministic
  • batch_normalization: Think about better interface
  • softmax_cross_entropy: Rename normalize option
  • softmax_cross_entropy: Allow ignore_label configurable by init argument
  • split_axis: Make force_tuple True by default

links

  • initialW, initial_bias, ...: Unify the naming convention of the arguments.
  • input size, input channels, ...: Make them optional (we may need to change the overall APIs)
  • wscale: Remove it.
  • set_state / reset_state of RNN units: Unify the interface.
  • BatchNormalization: Think about better interface
  • ConvolutionND: Make the bias enabled by default
  • Linear: Make the number of axes for batch dimensions configurable
@jekbradbury
Copy link

jekbradbury commented Dec 16, 2016

Thanks so much! I'm excited about version 2. I will send a dataset system PR soon.
As to the second point, I thought it was important for certain kinds of RL applications but we were probably just coding them in an unusual way. You're right, I can't think of good reasons to accumulate to non-terminal nodes.

@jekbradbury
Copy link

jekbradbury commented Dec 16, 2016

Wait, I did have a reason. It’s kind of complicated, but it’s basically the following:
Say you have a sequence-to-sequence decoder. The memory usage of the variable that goes into the softmax would be quite large if the forward computation for the softmax linear layer were run for all timesteps at once (batch size times sequence length times vocabulary size elements, in practice multiple gigabytes and the main memory bottleneck in problems like machine translation). So instead I do this:
I’ve built a context manager where you can say:

#for every timestep:
h = self.rnn(x)
with backprop_until(h):
    o = self.out(h)
    loss = F. softmax_cross_entropy(o, target[t])
    loss.backward()
hooks.append(h)
#later, after all timesteps
hook = F.concat(hooks)
hook.grad = F.concat(h.grad for h in hooks)
hook.backward()

That's fine, but it’s much less intuitive than what you’d be able to do if backward didn’t zero out intermediate gradients, which is the same as above up to loss.backward() and then goes like:

losses.append(loss)
#later, after all timesteps
sum(losses).backward()

What chainer currently does is fine, though it relegates this “early backprop” idea to a “weird trick” that few people will realize they can use; if the backward scheduler were changed to not zero intermediate gradients between calls, early backprop would be less weird and more easily applied -- but that's a significant change and it may not be worth it.
The context manager itself is as follows:

@contextmanager
def backprop_until(var):
    """Context manager for early-backprop subchains.

    Makes ``var`` temporarily root, yields control to the user to run the
    subchain forward+backward, then relinks ``var`` to the main chain.

    """
    old_creator, var.creator = var.creator, None
    yield
    if old_creator is not None: var.set_creator(old_creator)

@beam2d
Copy link
Author

beam2d commented Dec 22, 2016

(sorry for late replay, I came down with a cold around the last weekend) Thanks for writing up the use case of grad accumulation to intermediate nodes. It now makes sense to me. I think it's better to provide a new backprop interface to allow multiple starting points (i.e., the "root nodes" of the backward path) instead of accumulating gradients to intermediate nodes, since the latter approach will break codes which actually runs the backprop through the same subgraph multiple times (I don't think such a code is reasonable in terms of performance, but it should work in some try-and-error for coding in which the performance does not matter). The current Variable.backward() can be seen as a special case of this method for which only one root node is passed as the starting point of the backward path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment