Created
August 27, 2013 20:19
-
-
Save eamartin/6358606 to your computer and use it in GitHub Desktop.
Notes from a summer of working with the LISA group.
Mostly about what I learned about training supervised GSNs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Feel free to contact me with any questions about this material: | |
email: [email protected] | |
Skype: eric.a.martin | |
Here are some notes from a summer of work. I'm primarily writing these down | |
so the knowledge doesn't get lost. Most of this documented is describing what | |
did and did not work for training supervised GSNs. I'm also including a couple | |
of curious things I found this summer. | |
Supervised GSN training experience | |
---------------------------------- | |
These supervised GSNs attempted to learn the joint distribution between 2 vectors. | |
One vector was put at the bottom of the GSN, and the other vector was put on top | |
of the GSN (as the top layer) across the hidden layers. | |
For my work, I was primarily trying to get results on MNIST. The 784 component image | |
was the bottom layer of the GSN, the 10 component prediction vector was the top layer. | |
I'll refer to the bottom layer (the images) as x and the top layer (predictions) as y. | |
All experiments were done with binary cross entropy costs on both the top and | |
bottom layers. Training was done by SGD with a momentum term. The | |
MonitorBasedLRAdjuster of pylearn2 was used to adjust the learning rate. Various | |
combinations of network depths, layer sizes, and noise types and magnitudes were | |
explored. The cross entropy costs on all layers were normalized by the vector size | |
(so cost was average over each component of the vector). | |
I was primarily looking at the classification accuracy on MNIST, while training to | |
maintain models that appeared to both sample (and mix) well. Training was done with | |
noise, but prediction was done without any noise. Classification was done by running | |
the GSN and average the various computed prediction vectors and then taking the | |
argmax. Classification performance was best when only the first computed prediction | |
vector was used. Classification also involved clamping (ie fixing at each iteration) | |
the x vector. | |
Note that there are quite a few different ways to train a joint model such as a GSN. | |
Possibilities include setting both x and y, setting just x and predicting y, and | |
vice versa. | |
Key takeaways: | |
Setting both x and y during training resulted in bad classification performance. | |
I was never able to get under ~3.8% with this sort of training. The 3.8% error | |
happened on a network with a single hidden layer, and the error became much worse | |
as the network became deeper (50% for network with 3 hidden layers). I believe this | |
happened because both x and y learned to autoencode themselves seperately and the | |
network never learned to communicate between layers. Adding large amounts of noise | |
to the top layer did not help with this problem. I believe this happened because of | |
the very small amount of information at the top layer (its just a 10 component | |
1-hot vector, so log_2(10) ~= 3.3 bits) and the relatively large layers (at least | |
100 neurons) next to the top layer. | |
The non-communicating layers hypothesis was also supported by the relative success | |
of training where x was given and y was predicted (backprop on both x and y). This | |
sort of training forced information from the bottom layer to reach the top layer for | |
the top layer to have predictive power. I achieved 1.25% error (125 errors) on MNIST | |
using this sort of training on a network with a single hidden layer. Notably, there | |
seemed to be a trade-off between quantity of noise (and how well the model mixed) | |
and classification accuracy. Again, networks with just a single hidden layer | |
performed better than networks with multiple hidden layers, but the difference wasn't | |
nearly as large in this case. Results varied greatly for networks with 2 or 3 layers, | |
but I had a couple of tricks that got about 1.6% error. | |
This trick involved doing training very similarly to Ian Goodfellow's paper on | |
jointly training deep Boltzmann machines. I would set some subset of the inputs, | |
run the network and compute the cost function of the complement of that subset. | |
I would generally keep about 75% of the x units and only 20% of the y units (the | |
1-hot vector). Computing costs on all of the elements rather than just the | |
complement had no significant impact on performance (loss of .05%, probably not | |
meaningful). All of these trials were done with no noise anywhere within the | |
network (and adding noise hurt classification). Running with costs evaluated on | |
all elements seems like it should be identical to just applying dropout to the | |
top and bottom layers, but the results were considerably worse when I just applied | |
dropout to the top and bottom layers. The only difference between these 2 approaches | |
is that the standard dropout solution applies the corruption at every iteration of | |
the model, while my DBM inspired approach only applied the corruption before | |
initializing the network. | |
Other things of note: | |
I generally only applied noise on the input, the top hidden layer, and the prediction | |
layer because this appeared to work best. | |
To corrupt the y (1-hot) layer, I added gaussian noise of magnitude .75 and then | |
took the softmax of the vector (to give an magnitude=1 probability vector). | |
Training a model with both x and y and then doing more training trying to predict y | |
given x didn't get the absolute best classification results but did produce models | |
that both sampled well and classified fairly well. | |
Other things I found throughout the summer | |
------------------------------------------ | |
One interesting thing happened while I debugging my GSN. I reduced the case until | |
I was dealing with a single autoencoder. I was attempting to learn the identity | |
map with an autoencoder with no noise, linear activations, tied weights, | |
mean squared error cost function, and SGD. The input data was uniformly distributed | |
within the n=10 dimensional unit hypercube, and there were n hidden and output layers. | |
Ideally, this autoencoder should easily learn the identity function. However, when | |
I was attempting to train this autoencoder (with bias terms included), the weights | |
went to 0 and the biases just learned the center of the hypercube (mean squared error | |
was equal to variance of distribution). The output with bias is equal to | |
W'(Wx + b_1) + b_2, where W' is the transpose of W. When I ran the same network | |
without bias terms (or the bias constrained to 0, same thing), the network fairly | |
quickly learned a W such that W'Wx = x => W'W = I => W is a unitary matrix. This | |
was an interesting case of add power to the model (with the bias term) resulting in | |
worse performance due to creation of a new local minima (when W = 0). | |
Also, I experimented with Radford Neal's funnel distribution (code at | |
https://github.com/lightcatcher/funnel_gsn ). This distribution is known for being | |
difficult to sample from. My various attempts to sample this distribution were | |
unsuccessful (couldn't get the GSN to capture any structure). I could get correct | |
marginal distributions for some of the variables, but the joint distribution would | |
be all wrong. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment