beam2d/chainer-v2-api-change.md

Last active December 22, 2016 06:37

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/beam2d/f7b817c3ce6ada32a1f9c8cc7dea4a5a.js"></script>
Save beam2d/f7b817c3ce6ada32a1f9c8cc7dea4a5a to your computer and use it in GitHub Desktop.

Download ZIP

Candidate API changes for Chainer v2

Raw

chainer-v2-api-change.md

Candidate API changes for Chainer v2

In this gist, I list up the possible changes on existing APIs for Chainer v2. It also includes some ideas of adding new APIs related to existing ones. Note that this is not a complete list of future changes for v2.

Core features

Variable

__len__ should return the length of the first axis.
- See chainer/chainer#1792
Remove volatile flag?
- It can be replaced by no_backprop_mode, which is easier to use.
- We can also remove Flag
Remove zerograd
Support uninitialized Variable
- It can be used for the better support of uninitialized parameters (used for "parameter shape placeholder")
Support optimizer for Variable
- It enables us to choose an optimizer for each parameter variable.
- We need a design document and discussions on it.

Function

type_check_enable: Make it a global/thread-local flag
Make type checking enabled only in debug mode by default

Link

add_param and add_persistent does not work well with PyCharm
- An added attribute is not recognized by the IDE
- It is better to design a new API to avoid this issue
add_uninitialized_param should be redesigned to work with uninitialized Variable (see above)

Chain/ChainList

add_link: see the above discussions on add_param
to_gpu should be applied to links added in the future
We want to support duplicated parents of a link (it is currently prohibited)

Optimizer

Remove deprecated methods (most of which can be replaced by optimizer hooks)
Support per-parameter configurations (see notes on Variable)

Serializer/Deserializer

Stop using Abstract Base Class
Support non-strict mode that allows the parameter set mismatch to the set of loaded parameters

Updater/StandardUpdater

The interface should be updated to support the updated optimizer APIs.

Reporter

Support non-scalar observations.

FunctionSet

Remove it.

cuda module

Remove deprecated APIs.
Deprecate get_device() and add alternatives: get_device_from_id, get_device_from_object, etc.
to_cpu / to_gpu: support Variable as an input.

Collection of functions, links, and extensions

extensions

snapshot/snapshot_object: Remove the trigger option, which is redundant.
LogReport: Think a better name of the trigger option.

functions

Flags: make them global/thread-local and remove them from arguments
- use_cudnn
- train/test
- deterministic
batch_normalization: Think about better interface
softmax_cross_entropy: Rename normalize option
softmax_cross_entropy: Allow ignore_label configurable by init argument
split_axis: Make force_tuple True by default

jekbradbury commented Nov 28, 2016 •

edited

Loading

I've been the only Chainer user at my company since last fall, but about two months ago the rest of our (fairly small) research team started to switch, and now everyone is using it 😃. Over the past several weeks of helping the team get up to speed with the framework, I've noticed a handful of unexpected behaviors that often trip people up. Many of them would be good things to revisit in a major-version update. In no particular order:

The split_axis function defaults to force_tuple=False; this allows for very subtle bugs when the batch size or sequence length is 1 and the variable isn't actually split: since variables can be sliced and iterated, and behave somewhat like sequences, the error often doesn't show itself until several steps later in the computation graph. force_tuple should probably default to True and allow the user to turn it off if desired.
Variable.backward() overwrites gradients, even if zerograds() or cleargrads() isn't run first. Sometimes we want to accumulate some gradients, then accumulate some more, then make an update; this makes that effectively impossible. It looks like this is because the needs_copy status of a particular gradient array wouldn't be preserved between calls to backward(), but there might be a way to account for this.
L.Linear works when applied to 3D tensors, but assumes that the last two dimensions are both channel dimensions and only the first one is a batch dimension. It is much more common for us to want the first two dimensions to be treated as batch dimensions and only the last as a channel; I imagine this might be a difference between computer vision and NLP work? Maybe we can provide both behaviors and a parameter to choose which one.
CHAINER_TYPE_CHECK is separate from debug_mode, even though they're both for debugging, and the former defaults to on while the latter defaults to off.
F.batch_normalization and F.dropout both have different modes for training and testing. However one of them takes a flag that is True for training and the other takes a flag that is True for testing; these flags are required even though in most cases the functions should be able to figure out the correct behavior from the volatility flag of their input. I think this one would be completely fixed by the move to global train/test flags, so I'm definitely +1 on that.
Sometimes it's nice to have the input batch be a full-fledged object (say, you want it to hold a pointer to a vocabulary, or a flag denoting a task, or some kind of nested structure) rather than the three currently allowed types of ndarray, tuple, and dict. This whole issue can be avoided by merging this PR chainer/chainer#1654 and simply passing the input batch unmodified if it's neither a tuple nor a dict.

Another thing to consider in an update is a more powerful data/dataset system. I've built something that lets me write declarative dataset definitions like:

TEXT = Field(time_series=True, tokenize=PTBEscape(),
             before_numericalizing=WordDropout(0.25))
POS_LABELS = Field(time_series=True, after_numericalizing=IgnorePadding())
pos_train_data = BucketIterator(batch_size=32, TSVDataset(
    name='pos', path='data/pos/pos_wsj.train',
    fields=[('text', TEXT), ('labels', POS_LABELS)]))
# usage
>>>next(pos_train_data)
Batch(batch_size=32, text=[ndarray (32, 17)], labels=[ndarray (32, 17)])

and I'd be happy to turn it into a PR if there's interest.

jekbradbury commented Nov 28, 2016

Also, F.softmax_cross_entropy has an ignore_label attribute but doesn't allow it to be set in the constructor.

takagi commented Dec 2, 2016 •

edited

Loading

Please add that ConvolutionND link should have default bias.

Author

beam2d commented Dec 16, 2016

Thank you for providing suggestions (and sorry for late reply). I added most of your requests to the list.

@jekbradbury I want to ask for the point 2. Variable.backward() currently accumulates gradients for terminal nodes (e.g. parameter variables). Is this not enough for your use cases? Do you want to accumulate gradients to intermediate nodes as well?
And I'm also interested in your powerful data/dataset system. Could you make it into a PR?

jekbradbury commented Dec 16, 2016 •

edited

Loading

Thanks so much! I'm excited about version 2. I will send a dataset system PR soon.
As to the second point, I thought it was important for certain kinds of RL applications but we were probably just coding them in an unusual way. You're right, I can't think of good reasons to accumulate to non-terminal nodes.

jekbradbury commented Dec 16, 2016 •

edited

Loading

Wait, I did have a reason. It’s kind of complicated, but it’s basically the following:
Say you have a sequence-to-sequence decoder. The memory usage of the variable that goes into the softmax would be quite large if the forward computation for the softmax linear layer were run for all timesteps at once (batch size times sequence length times vocabulary size elements, in practice multiple gigabytes and the main memory bottleneck in problems like machine translation). So instead I do this:
I’ve built a context manager where you can say:

#for every timestep:
h = self.rnn(x)
with backprop_until(h):
    o = self.out(h)
    loss = F. softmax_cross_entropy(o, target[t])
    loss.backward()
hooks.append(h)
#later, after all timesteps
hook = F.concat(hooks)
hook.grad = F.concat(h.grad for h in hooks)
hook.backward()

That's fine, but it’s much less intuitive than what you’d be able to do if backward didn’t zero out intermediate gradients, which is the same as above up to loss.backward() and then goes like:

losses.append(loss)
#later, after all timesteps
sum(losses).backward()

What chainer currently does is fine, though it relegates this “early backprop” idea to a “weird trick” that few people will realize they can use; if the backward scheduler were changed to not zero intermediate gradients between calls, early backprop would be less weird and more easily applied -- but that's a significant change and it may not be worth it.
The context manager itself is as follows:

@contextmanager
def backprop_until(var):
    """Context manager for early-backprop subchains.

    Makes ``var`` temporarily root, yields control to the user to run the
    subchain forward+backward, then relinks ``var`` to the main chain.

    """
    old_creator, var.creator = var.creator, None
    yield
    if old_creator is not None: var.set_creator(old_creator)

Author

beam2d commented Dec 22, 2016

(sorry for late replay, I came down with a cold around the last weekend) Thanks for writing up the use case of grad accumulation to intermediate nodes. It now makes sense to me. I think it's better to provide a new backprop interface to allow multiple starting points (i.e., the "root nodes" of the backward path) instead of accumulating gradients to intermediate nodes, since the latter approach will break codes which actually runs the backprop through the same subgraph multiple times (I don't think such a code is reasonable in terms of performance, but it should work in some try-and-error for coding in which the performance does not matter). The current Variable.backward() can be seen as a special case of this method for which only one root node is passed as the starting point of the backward path.

beam2d/chainer-v2-api-change.md

Candidate API changes for Chainer v2

Core features

Variable

Function

Link

Chain/ChainList

Optimizer

Serializer/Deserializer

Updater/StandardUpdater

Reporter

FunctionSet

cuda module

Collection of functions, links, and extensions

extensions

functions

links

jekbradbury commented Nov 28, 2016 • edited Loading

jekbradbury commented Nov 28, 2016

takagi commented Dec 2, 2016 • edited Loading

beam2d commented Dec 16, 2016

jekbradbury commented Dec 16, 2016 • edited Loading

jekbradbury commented Dec 16, 2016 • edited Loading

beam2d commented Dec 22, 2016

jekbradbury commented Nov 28, 2016 •

edited

Loading

takagi commented Dec 2, 2016 •

edited

Loading

jekbradbury commented Dec 16, 2016 •

edited

Loading

jekbradbury commented Dec 16, 2016 •

edited

Loading