eamartin · August 27, 2013 20:19
diff --git a/emartin_lisa_notes.txt b/emartin_lisa_notes.txt
 Feel free to contact me with any questions about this material:
 email: [email protected]
 Skype: eric.a.martin

 Here are some notes from a summer of work. I'm primarily writing these down
 so the knowledge doesn't get lost. Most of this documented is describing what
 did and did not work for training supervised GSNs. I'm also including a couple
 of curious things I found this summer.

 Supervised GSN training experience
 ----------------------------------
 These supervised GSNs attempted to learn the joint distribution between 2 vectors.
 One vector was put at the bottom of the GSN, and the other vector was put on top
 of the GSN (as the top layer) across the hidden layers.
 For my work, I was primarily trying to get results on MNIST. The 784 component image
 was the bottom layer of the GSN, the 10 component prediction vector was the top layer.

 I'll refer to the bottom layer (the images) as x and the top layer (predictions) as y.

 All experiments were done with binary cross entropy costs on both the top and
 bottom layers. Training was done by SGD with a momentum term. The
 MonitorBasedLRAdjuster of pylearn2 was used to adjust the learning rate. Various
 combinations of network depths, layer sizes, and noise types and magnitudes were
 explored. The cross entropy costs on all layers were normalized by the vector size
 (so cost was average over each component of the vector).

 I was primarily looking at the classification accuracy on MNIST, while training to
 maintain models that appeared to both sample (and mix) well. Training was done with
 noise, but prediction was done without any noise. Classification was done by running
 the GSN and average the various computed prediction vectors and then taking the
 argmax. Classification performance was best when only the first computed prediction
 vector was used. Classification also involved clamping (ie fixing at each iteration)
 the x vector.

 Note that there are quite a few different ways to train a joint model such as a GSN.
 Possibilities include setting both x and y, setting just x and predicting y, and
 vice versa.

 Key takeaways:
 Setting both x and y during training resulted in bad classification performance.
 I was never able to get under ~3.8% with this sort of training. The 3.8% error
 happened on a network with a single hidden layer, and the error became much worse
 as the network became deeper (50% for network with 3 hidden layers). I believe this
 happened because both x and y learned to autoencode themselves seperately and the
 network never learned to communicate between layers. Adding large amounts of noise
 to the top layer did not help with this problem. I believe this happened because of
 the very small amount of information at the top layer (its just a 10 component
 1-hot vector, so log_2(10) ~= 3.3 bits) and the relatively large layers (at least
 100 neurons) next to the top layer.

 The non-communicating layers hypothesis was also supported by the relative success
 of training where x was given and y was predicted (backprop on both x and y). This
 sort of training forced information from the bottom layer to reach the top layer for
 the top layer to have predictive power. I achieved 1.25% error (125 errors) on MNIST
 using this sort of training on a network with a single hidden layer. Notably, there
 seemed to be a trade-off between quantity of noise (and how well the model mixed)
 and classification accuracy. Again, networks with just a single hidden layer
 performed better than networks with multiple hidden layers, but the difference wasn't
 nearly as large in this case. Results varied greatly for networks with 2 or 3 layers,
 but I had a couple of tricks that got about 1.6% error.

 This trick involved doing training very similarly to Ian Goodfellow's paper on
 jointly training deep Boltzmann machines. I would set some subset of the inputs,
 run the network and compute the cost function of the complement of that subset.
 I would generally keep about 75% of the x units and only 20% of the y units (the
 1-hot vector). Computing costs on all of the elements rather than just the
 complement had no significant impact on performance (loss of .05%, probably not
 meaningful). All of these trials were done with no noise anywhere within the
 network (and adding noise hurt classification). Running with costs evaluated on
 all elements seems like it should be identical to just applying dropout to the
 top and bottom layers, but the results were considerably worse when I just applied
 dropout to the top and bottom layers. The only difference between these 2 approaches
 is that the standard dropout solution applies the corruption at every iteration of
 the model, while my DBM inspired approach only applied the corruption before
 initializing the network.

 Other things of note:
 I generally only applied noise on the input, the top hidden layer, and the prediction
 layer because this appeared to work best.

 To corrupt the y (1-hot) layer, I added gaussian noise of magnitude .75 and then
 took the softmax of the vector (to give an magnitude=1 probability vector).

 Training a model with both x and y and then doing more training trying to predict y
 given x didn't get the absolute best classification results but did produce models
 that both sampled well and classified fairly well.

 Other things I found throughout the summer
 ------------------------------------------
 One interesting thing happened while I debugging my GSN. I reduced the case until
 I was dealing with a single autoencoder. I was attempting to learn the identity
 map with an autoencoder with no noise, linear activations, tied weights,
 mean squared error cost function, and SGD. The input data was uniformly distributed
 within the n=10 dimensional unit hypercube, and there were n hidden and output layers.
 Ideally, this autoencoder should easily learn the identity function. However, when
 I was attempting to train this autoencoder (with bias terms included), the weights
 went to 0 and the biases just learned the center of the hypercube (mean squared error
 was equal to variance of distribution). The output with bias is equal to
 W'(Wx + b_1) + b_2, where W' is the transpose of W. When I ran the same network
 without bias terms (or the bias constrained to 0, same thing), the network fairly
 quickly learned a W such that W'Wx = x => W'W = I => W is a unitary matrix. This
 was an interesting case of add power to the model (with the bias term) resulting in
 worse performance due to creation of a new local minima (when W = 0).

 Also, I experimented with Radford Neal's funnel distribution (code at
 https://github.com/lightcatcher/funnel_gsn ). This distribution is known for being
 difficult to sample from. My various attempts to sample this distribution were
 unsuccessful (couldn't get the GSN to capture any structure). I could get correct
 marginal distributions for some of the variables, but the joint distribution would
 be all wrong.
	Feel free to contact me with any questions about this material:
	email: [email protected]
	Skype: eric.a.martin

	Here are some notes from a summer of work. I'm primarily writing these down
	so the knowledge doesn't get lost. Most of this documented is describing what
	did and did not work for training supervised GSNs. I'm also including a couple
	of curious things I found this summer.

	Supervised GSN training experience
	----------------------------------
	These supervised GSNs attempted to learn the joint distribution between 2 vectors.
	One vector was put at the bottom of the GSN, and the other vector was put on top
	of the GSN (as the top layer) across the hidden layers.
	For my work, I was primarily trying to get results on MNIST. The 784 component image
	was the bottom layer of the GSN, the 10 component prediction vector was the top layer.

	I'll refer to the bottom layer (the images) as x and the top layer (predictions) as y.

	All experiments were done with binary cross entropy costs on both the top and
	bottom layers. Training was done by SGD with a momentum term. The
	MonitorBasedLRAdjuster of pylearn2 was used to adjust the learning rate. Various
	combinations of network depths, layer sizes, and noise types and magnitudes were
	explored. The cross entropy costs on all layers were normalized by the vector size
	(so cost was average over each component of the vector).

	I was primarily looking at the classification accuracy on MNIST, while training to
	maintain models that appeared to both sample (and mix) well. Training was done with
	noise, but prediction was done without any noise. Classification was done by running
	the GSN and average the various computed prediction vectors and then taking the
	argmax. Classification performance was best when only the first computed prediction
	vector was used. Classification also involved clamping (ie fixing at each iteration)
	the x vector.

	Note that there are quite a few different ways to train a joint model such as a GSN.
	Possibilities include setting both x and y, setting just x and predicting y, and
	vice versa.

	Key takeaways:
	Setting both x and y during training resulted in bad classification performance.
	I was never able to get under ~3.8% with this sort of training. The 3.8% error
	happened on a network with a single hidden layer, and the error became much worse
	as the network became deeper (50% for network with 3 hidden layers). I believe this
	happened because both x and y learned to autoencode themselves seperately and the
	network never learned to communicate between layers. Adding large amounts of noise
	to the top layer did not help with this problem. I believe this happened because of
	the very small amount of information at the top layer (its just a 10 component
	1-hot vector, so log_2(10) ~= 3.3 bits) and the relatively large layers (at least
	100 neurons) next to the top layer.

	The non-communicating layers hypothesis was also supported by the relative success
	of training where x was given and y was predicted (backprop on both x and y). This
	sort of training forced information from the bottom layer to reach the top layer for
	the top layer to have predictive power. I achieved 1.25% error (125 errors) on MNIST
	using this sort of training on a network with a single hidden layer. Notably, there
	seemed to be a trade-off between quantity of noise (and how well the model mixed)
	and classification accuracy. Again, networks with just a single hidden layer
	performed better than networks with multiple hidden layers, but the difference wasn't
	nearly as large in this case. Results varied greatly for networks with 2 or 3 layers,
	but I had a couple of tricks that got about 1.6% error.

	This trick involved doing training very similarly to Ian Goodfellow's paper on
	jointly training deep Boltzmann machines. I would set some subset of the inputs,
	run the network and compute the cost function of the complement of that subset.
	I would generally keep about 75% of the x units and only 20% of the y units (the
	1-hot vector). Computing costs on all of the elements rather than just the
	complement had no significant impact on performance (loss of .05%, probably not
	meaningful). All of these trials were done with no noise anywhere within the
	network (and adding noise hurt classification). Running with costs evaluated on
	all elements seems like it should be identical to just applying dropout to the
	top and bottom layers, but the results were considerably worse when I just applied
	dropout to the top and bottom layers. The only difference between these 2 approaches
	is that the standard dropout solution applies the corruption at every iteration of
	the model, while my DBM inspired approach only applied the corruption before
	initializing the network.

	Other things of note:
	I generally only applied noise on the input, the top hidden layer, and the prediction
	layer because this appeared to work best.

	To corrupt the y (1-hot) layer, I added gaussian noise of magnitude .75 and then
	took the softmax of the vector (to give an magnitude=1 probability vector).

	Training a model with both x and y and then doing more training trying to predict y
	given x didn't get the absolute best classification results but did produce models
	that both sampled well and classified fairly well.

	Other things I found throughout the summer
	------------------------------------------
	One interesting thing happened while I debugging my GSN. I reduced the case until
	I was dealing with a single autoencoder. I was attempting to learn the identity
	map with an autoencoder with no noise, linear activations, tied weights,
	mean squared error cost function, and SGD. The input data was uniformly distributed
	within the n=10 dimensional unit hypercube, and there were n hidden and output layers.
	Ideally, this autoencoder should easily learn the identity function. However, when
	I was attempting to train this autoencoder (with bias terms included), the weights
	went to 0 and the biases just learned the center of the hypercube (mean squared error
	was equal to variance of distribution). The output with bias is equal to
	W'(Wx + b_1) + b_2, where W' is the transpose of W. When I ran the same network
	without bias terms (or the bias constrained to 0, same thing), the network fairly
	quickly learned a W such that W'Wx = x => W'W = I => W is a unitary matrix. This
	was an interesting case of add power to the model (with the bias term) resulting in
	worse performance due to creation of a new local minima (when W = 0).

	Also, I experimented with Radford Neal's funnel distribution (code at
	https://github.com/lightcatcher/funnel_gsn ). This distribution is known for being
	difficult to sample from. My various attempts to sample this distribution were
	unsuccessful (couldn't get the GSN to capture any structure). I could get correct
	marginal distributions for some of the variables, but the joint distribution would
	be all wrong.