-
-
Save Newmu/acb738767acb4788bac3 to your computer and use it in GitHub Desktop.
""" | |
The MIT License (MIT) | |
Copyright (c) 2015 Alec Radford | |
Permission is hereby granted, free of charge, to any person obtaining a copy | |
of this software and associated documentation files (the "Software"), to deal | |
in the Software without restriction, including without limitation the rights | |
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
copies of the Software, and to permit persons to whom the Software is | |
furnished to do so, subject to the following conditions: | |
The above copyright notice and this permission notice shall be included in all | |
copies or substantial portions of the Software. | |
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
SOFTWARE. | |
""" | |
def Adam(cost, params, lr=0.0002, b1=0.1, b2=0.001, e=1e-8): | |
updates = [] | |
grads = T.grad(cost, params) | |
i = theano.shared(floatX(0.)) | |
i_t = i + 1. | |
fix1 = 1. - (1. - b1)**i_t | |
fix2 = 1. - (1. - b2)**i_t | |
lr_t = lr * (T.sqrt(fix2) / fix1) | |
for p, g in zip(params, grads): | |
m = theano.shared(p.get_value() * 0.) | |
v = theano.shared(p.get_value() * 0.) | |
m_t = (b1 * g) + ((1. - b1) * m) | |
v_t = (b2 * T.sqr(g)) + ((1. - b2) * v) | |
g_t = m_t / (T.sqrt(v_t) + e) | |
p_t = p - (lr_t * g_t) | |
updates.append((m, m_t)) | |
updates.append((v, v_t)) | |
updates.append((p, p_t)) | |
updates.append((i, i_t)) | |
return updates |
To your question @stablum: this is how Theano constructs the computation graph. The adam() function should only be called once to define the updates in the computational graph, therefore m
and v
get initialized to 0 once.
For people who struggle with the provided code and the message "Incompatible broadcastable dimensions.", they may need to modify the theano.shared(p.get_value() ... )
calls by adding the broadcastable=p.broadcastable option. Then the updates will be broadcastable in the same way as the original variables.
One more proposed change:
m = theano.shared(np.zeros(p.get_value().shape).astype(dtype=theano.config.floatX))
v = theano.shared(np.zeros(p.get_value().shape).astype(dtype=theano.config.floatX))
The code above doesn't handle scalar parameters correctly - the p.get_value() * 0.
will create a float64
, even if p.get_value()
returns a float32
.
That's right @bspeice. However, I 'm confused of the value of the b1 and b2, their values are set to 0.9 and 0.999 respectively in original paper.
Yes that is a mistake I think
No it is correct, see the update here again.
It is 1 - beta1 as beta1 and beta1 as 1 - beta1...
Which is 1 - 0.1 hence beta1 = 0.9 exactly what paper says, and 1 - 0.001 = 0.999 which is again exactly what paper says. Here they r using original beta1 as 1-beta1 and similarly with beta2.... Hece the confusion.
please i have a question
is this function is the built in function in tensor flow or this function is another function ????
Hi,
I guess that the 'm' and 'v' quantities have to be re-used in subsequent iterations, but then why are 'm' and 'v' initialized to 0, since the 'Adam' function has to be called many times for the same parameters?