Skip to content

Instantly share code, notes, and snippets.

@kor01
Last active September 19, 2017 13:35
Show Gist options
  • Save kor01/6bfb7d65190488461972e65d120b99b1 to your computer and use it in GitHub Desktop.
Save kor01/6bfb7d65190488461972e65d120b99b1 to your computer and use it in GitHub Desktop.
[neural_nets_mind_bugs] #neural_nets #mind_bugs
in conventional notations:
1. layer input = the tensor before activation
2. layer output = the tensor after activation
3. next layer input = this layer output transformed by the edges
4. delta_layer = dLoss / dlayer_input
in graphical notations:
1. variables and tensors are equally treated in derivative computation
2. loss is already mean_reduced over batch (no further reduction in gradient application)
3. functional pattern: given the gradient of the node output from every subscriber, compute gradient of every input tensor
4. shape pattern: the shape of gradient equals to the shape of input tensor
best practice:
1. normalize data (unit variance 0 mean) before optimize
2. when things don't work, always lower the learning rate first
3. weight initialization variance = 1 / sqrt(input_size) truncated by 2-stddev
operation with rank m output and rank n input has derivative (jacobian tensor) rank m + n
tensor value w.r.t a scalar loss L is the same shape
tensor x_1 = f(x_0), dL / d_x_0 = sum over e in x_1: dL / de * de / d_x0
all operation level derivative computation implement this formula in specialized way
linear operation: Y = XW + b, x is shape (m,n), W is shape (n, k), y is shape (m, k)
YX = dY / dX is a (m, k, m, n) tensor, YX[i, j, :, :] is a matrix where ith row of it is jth column of W W[j, :]
therefore weighted sum over all elements in Y: YX * dL/dy[:, :, None, None].sum(axis=0).sum(axis=1) = sigma_i_j(dL / dyij * dyij / dX) = sigma_i_j [dyij * [0, 0, ... w^T_j .. 0]^T] = sigma_i [0, 0, sigma_j(dyij * w^T_kj), 0 ..0] = [..sigma_j(dyij * w^T_kj) ..] = dY * W^T
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment