kor01 · September 19, 2017 13:35
diff --git a/back_prop.txt b/back_prop.txt
 in conventional notations:

 1. layer input = the tensor before activation

 2. layer output = the tensor after activation

 3. next layer input = this layer output transformed by the edges

 4. delta_layer = dLoss / dlayer_input


 in graphical notations:

 1. variables and tensors are equally treated in derivative computation

 2. loss is already mean_reduced over batch (no further reduction in gradient application)

 3. functional pattern: given the gradient of the node output from every subscriber, compute gradient of every input tensor

 4. shape pattern: the shape of gradient equals to the shape of input tensor


 best practice:

 1. normalize data (unit variance 0 mean) before optimize

 2. when things don't work, always lower the learning rate first

 3. weight initialization variance = 1 / sqrt(input_size) truncated by 2-stddev
diff --git a/formal_derivation.txt b/formal_derivation.txt
 operation with rank m output and rank n input has derivative (jacobian tensor) rank m + n

 tensor value w.r.t a scalar loss L is the same shape

 tensor x_1 = f(x_0), dL / d_x_0 = sum over e in x_1: dL / de * de / d_x0

 all operation level derivative computation implement this formula in specialized way
diff --git a/specialization_for_linear.txt b/specialization_for_linear.txt

 linear operation: Y = XW + b, x is shape (m,n), W is shape (n, k), y is shape (m, k)

 YX = dY / dX is a (m, k, m, n) tensor, YX[i, j, :, :] is a matrix where ith row of it is jth column of W W[j, :]


 therefore weighted sum over all elements in Y: YX * dL/dy[:, :, None, None].sum(axis=0).sum(axis=1) = sigma_i_j(dL / dyij * dyij / dX) = sigma_i_j [dyij * [0, 0, ... w^T_j .. 0]^T] = sigma_i [0, 0, sigma_j(dyij * w^T_kj), 0 ..0] = [..sigma_j(dyij * w^T_kj) ..] = dY * W^T
	in conventional notations:

	1. layer input = the tensor before activation

	2. layer output = the tensor after activation

	3. next layer input = this layer output transformed by the edges

	4. delta_layer = dLoss / dlayer_input


	in graphical notations:

	1. variables and tensors are equally treated in derivative computation

	2. loss is already mean_reduced over batch (no further reduction in gradient application)

	3. functional pattern: given the gradient of the node output from every subscriber, compute gradient of every input tensor

	4. shape pattern: the shape of gradient equals to the shape of input tensor


	best practice:

	1. normalize data (unit variance 0 mean) before optimize

	2. when things don't work, always lower the learning rate first

	3. weight initialization variance = 1 / sqrt(input_size) truncated by 2-stddev
	operation with rank m output and rank n input has derivative (jacobian tensor) rank m + n

	tensor value w.r.t a scalar loss L is the same shape

	tensor x_1 = f(x_0), dL / d_x_0 = sum over e in x_1: dL / de * de / d_x0

	all operation level derivative computation implement this formula in specialized way

	linear operation: Y = XW + b, x is shape (m,n), W is shape (n, k), y is shape (m, k)

	YX = dY / dX is a (m, k, m, n) tensor, YX[i, j, :, :] is a matrix where ith row of it is jth column of W W[j, :]


	therefore weighted sum over all elements in Y: YX * dL/dy[:, :, None, None].sum(axis=0).sum(axis=1) = sigma_i_j(dL / dyij * dyij / dX) = sigma_i_j [dyij * [0, 0, ... w^T_j .. 0]^T] = sigma_i [0, 0, sigma_j(dyij * w^T_kj), 0 ..0] = [..sigma_j(dyij * w^T_kj) ..] = dY * W^T