- tanh
tanh(x) = sinh(x)/cosh(x) = ( e^x - e^-x )/( e^x + e^-x )
- Sigmoid
S(x) = 1/(1 + e^-x) = e^x/(e^x + 1)
- Relu
f(x) = max(0,x)
- Noisy Relu
f(x) = max(0,x+Y), Y Y ∼ N(0,σ(x))
N is gaussian noise
- Leaky Relu
f(x) = x if x>0
0.01x otherwise
or
f(x) = max(x, 0.01x)
- Parametric Relu
f(x) = x if x>0
ax otherwise
or
f(x) = max(x, ax)
- ELU
f(x) = max(x, a(e^x -1))
Ref: https://en.wikipedia.org/wiki/Activation_function
- Gradient Descent
θ = θ − η⋅∇θJ(θ)
- Stocastic Gradient Descent
θ= θ − η⋅∇θJ(θ;x(i);y(i))
- Mini-batch gradient descent
θ = θ − η⋅∇θJ(θ;x(i:i+n);y(i:i+n))
- SGD + Momentum
vt = γ vt−1 + η∇θJ(θ)
θ = θ − vt
- Nesterov accelerated gradient
vt = γvt−1 + η∇θJ(θ− γvt−1)
θ = θ − vt
- Adagrad
g(t,i) = ∇θJ(θ(t,i))
θ(t+1,i) = θ(t,i) − η⋅g(t,i)
θ(t+1,i) = θ (t,i)− η/√(G(t,i)i+ϵ)⋅g(t,i)
θ(t+1) = (θ(t) − η/√(Gt+ϵ)) ⊙ gt
- Adadelta
E[g2]t = γE[g2]t−1 + (1−γ)g2t
Δθt = −η⋅gt,i
θt+1 = θt + Δθt
Δθt= (−η/(√E[g2]t+ϵ)) gt
E[Δθ^2]t = γE[Δθ^2]t−1 + (1−γ) Δθ^2t
RMS[Δθ]t = √E[Δθ^2]t+ϵ
Δθt= (−RMS[Δθ]t−1/RMS[g]t).gt
θt+1 = θt + Δθt
- Adam
mt = β1 mt−1 + (1−β1) gt
vt = β2 vt−1+ (1−β2) g2
m̂ t = mt/1−βt1
v̂ t = vt/1-βt2
θt+1 = θt − (η/√(v̂ t+ϵ))m̂ t
ref: http://ruder.io/optimizing-gradient-descent/
- L1
J(w) = ∑i (y(i) - f(xi))^2 + a(w)
- L2
J(w) = ∑i (y(i) - f(xi))^2 + a(w^2)
- Quadractic Cost
aka mean squared error, maximum likelihood, and sum squared error.
C MST(W,B,Sr,Er) = 0.5∑j(aLj−Erj)2
∇aC MST=(aL−Er)
- Cross-entropy cost
aka Bernoulli negative log-likelihood and Binary Cross-Entropy
C CE(W,B,Sr,Er)= −∑j [Erj ln(aLj) + (1−Erj) ln(1−aLj) ]
∇aCCE = (aL−Er)/(1−aL)(aL)
- Dense
- Conv1D
- Conv2D
- Pooling
- Stride
- Embedding
- Recurrent Neural Network
- LSTM
- GRU
- Locally connected layer
TBC.