- Simple Intro by Blei mostly going over review paper of Jordan
- Later introduce SVI (Stochastic VI) as a remedy to solve VI tractably with large dataset.
- Review the black box inference -assumption free VI- http://www.jmlr.org/proceedings/papers/v33/ranganath14.pdf
- Key idea is replacing gradient and the expectation in VI formulation. Since expectation reqiuires exponential family assumption to work replacing expectation and gradient solves this if overall method is stochastic since your samples are unbiased gradient estimates satisfying Robinson-Monroe conditions however the variance is very large and it requires even further tricks
- Final part of the tutorial was about extending VAE. The main idea is removing uncorrolated noise assumption. A few pointers to the literature:
- Variational Gaussian Process - http://dustintran.com/papers/TranRanganathBlei2016.pdf
- Gradient Flow https://arxiv.org/pdf/1401.4082v3.pdf
- Flow like work mostly about reaching to any covariate through sequence of approximations https://arxiv.org/pdf/1602.05473v4.pdf
- Somewhere between diagonal and full covariance matrix for gaussias http://aivalley.com/Papers/textVAUX_TR.pdf
- Normlizing Flow https://arxiv.org/pdf/1606.04934v1.pdf
-
Very clear review of the adverserial training. Relateding NCE, GAN etc with each other with a clean table.
-
Plug and Play Generative Models
-
Combining multiple different generative models
-
Visually appealing results
-
- Very interesting talk everyine should watch it when the videos are released. He had many major points but most striking ones were:
- ML/Stats community can help ecology a lot. For example, we are over-producing electricity and food more that 30% since we can not predict the usage.
- Intelligence emerged in a natural environment and thinking artificial intelligence can emerge without natural environment sounds pretty wrong. The most striking point for me when he put natural/artifical and simulated/real as two different axeses. This is clearly is the correct way to approach to the transfer learning problem. We need simulated and natural environments for emergent behavior because emergent behavior is the key point of nature.
- There is nothing called meta-data it is invented by NSA, all meta-data is a data for ML prupose.
-
Best Paper Award Value Iteration Networks
- Value iteration is in principle convnet :) It is matrix multiplication (state transitions) followed with max-pooling (max over future states). So, they simply did this and learned the entire policy directly.
- It works great generalizes to some other enivornments
-
Hiearchical Clustering via …
- Hierarchical clustering does not have a clear cost function unline k-means and k-medoids. And, this paper is simply proposing that
- Let's say T is the tree of the hierarchical clustering s.t. each leaf is a data point. Then the cost is \sum_{i,j} k(i,j) |leaves(T(lca(i.j))|
- lca(a, b): lowest common ancestor of a and b
- leaves(T(x)): number of leaves having x as ancestor
- k(i, j): similarity metric of data points
- This is NP-hard (Dasgupta 16).
- |leaves(T(lca(i.j))| - 1 is an ultrametric (metric w/ strong triangle inequality)
- They propose O(n^3) algorithm using ILP and its LP relaxation
- Similar result in SODA17 by Moses
-
Self Cluster Query…
- Active setting for clustering. Interaction is the question "is a and b in the same cluster?"
- Theoretical setup n points, k clusters over d dimension
- With no extra assumption, learned needs to ask Gamma(n) questions. So not applicable.
- If there is a margin between clusters, one can make O(knd) algorithm with O(k log n) questions.
-
Time-Contrastive Learning and Nonlinear ICA
- Main idea is this if I have a nonlinear ICA w/ gaussians and design a contrastive learning scheme as divide temporal scale into k parts and learn a supervised MLP seperating segments. They proove the the learned representation is indeed a linear ICA.
- Although their setup is gaussian, the variance of the gaussian is non-stationary. They also assume mixing function is smooth and non-linear.
-
Good Seedings for k-Means
- Problem is can we have efficient seeding for k-Means. k-Means++ does produce good seeds but not so efficient requires linear pass over all points.
- They start with k-means++ and its D^2 sampling is sampling from p(x) \sum d(x,old_cent)^2
- They design a Markov chain with same stationary distribution and state points
- They require a pass over dataset for initial proposal distribution which is used in MC-MC
- Have good guarantess and super easy to use simply pip install kmc2
-
Using Fast Weights to Attend to the Recent Past
- Main problem is modelling short-term memory so basically state-dynamics+short term mem+long term mem.
- Main idea is very simple adding intermediate steps within the RNN. (Not so) suprsingly, this ends-up being hopfield networks.
-
Sequential Neural Models with Stochastic Layers
- Idea is combining RNN with state-space models.
- RNN is not stochastic friendly but state-space is. However, state space is not easy to learn. Hence, they combine them.
- Most hacky part is they design a inverse RNN to model and learn the proposal function.
- Mostly very straight-forward usage of VAE and RNN with the aforementioned little hack.
-
Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences
- Try to solve vanishing gradients to learn very long RNNs with no issue.
- They also solve async input sources.
- Key idea is using a khronos gate to keep track of time per neuron. Hence, LSTM sleeps if it is not working and gradients do not vanish.
-
Deep Learning w/o Poor Minima
- They show that if network is linear, no saddle point without a hessian w/ negative eigenvalue. So, it is easy to optimize with GD.
- If there is only ReLu as non-linearity, ReLu becomes linear unless activation pattern is changing hence it can be analyzed using linear NNs.
- They have very non-realistic assumptions similar to LeCun's COLT paper. However, they have half of the assumptions so good direction.
-
Poking Paper
- Random actions make you learn relationships between actions and results. Basic idea is combination of encoder/decoder in predictive setup. Encoder decoders can be in-terms of images and/or actions.
-
Draw What and Where
- They want a conditional GAN but they want to control location like draw a bird right here.
- They proposed a very hacky form of conditional GAN. They design adhoc architectures to give location as an input
-
Weight Normalization
- Batch norm without bias
-
Finding features which seperates distributions
-
Interpretability of data through examples
- Prototype images do not give full picture while looking at the dataset because of very small clusters.
- Prototype + criticisms give better picture
- github.com/BeenKim/MMD-critic
- Unsupervised learning is about discovering high-level abstractions and we are failing. The fact that conditional labels helping might point that sample complexity is an issue. Two important thing might help for this
- Learning representations over different time-scales
- Exploiting causality more
-
Real NVP
- If your generative model is invertable, you can use change of variable to related encoder and decoder so you learn a signle model.
- However, the invertable transformation means input and output dimension is same. So, input noise is very high dimensional.
- They design a specific architecture to make Jacobian computation tractable.
-
Conditional PixelCNN
- They try to solve the blind spot in PixelCNN.
- They design two filters one horizontal and one vertical and combine them through gating.
-
Stochastic Length Networks
- What we are doing is playing the game of fixing errors and linear chain is not a good idea.
- If you analyze stochastic network, it gives connection between all layers so key is lots of skip connections
- AdaNet: Dense connectivity. They connect everything with the same filter size.
- If you have pooling, first make dense layers then pool.
-
Counter Factual Inference
-
Causal inference from observational data
-
Not supervised learning because it does not trained to differentiate the influence of the treatment(y) of x,y->success. Main idea is what if y was y'?
-
y_t|X, interest un E[Y_1 - Y_0|X] factual vs counterfactual (action/treatment is once)
-
Key Idea: this is domain adaptation problem
-
If treatments are random, same distribution in train and test
-
Idea is learn a feature st control and train are similar and use only such points
-
Implicitly assumes there is a policy
-
- Are we actually learning meaningful representations
- No. We are adding the structure with our eyes. The model do not understand the difference btw mountain and mouse in terms of scale.
- We need to understand action as well. In cake analogy, action is the spoon.
- Necessity of generative models
- Babies first learn all phonems then loose them after learning language. If they are useless, may be there is generative models of phonems in our brain.
- Interpretability is not important because it is only an engineering intuition and unnecessary. In a way, we want to be in charge although we do not need to be. For example, if I am in a taxi I do not care what driver is thinking etc.
- Yann LeCun We are missing basic principles behind human/animal learning. Is brain minimizing an optimization functior or doing sth else? What is the equivalent of Bernoulli dynamics for intelligence?
- Interpretability vs Principles
- Emerging behavior is the reason for not interpretable models. For example, neurons are simples and gas dynamics are simple, however their emergent behavior is not interpretable. Compositionality is related to interpretability in a way which level you want the interpretability?
- Humans do unsupervised learning on-line so why are we not doing it?
- The brain has memory so we try to get that memory filled. So, it should be on-line but with lots of memory.
- What is the correct measure for unsupervised learning
- What we have is simple toy test beds and hope that they will go somewhere else. For example, mnist was not because of digit classification. So, basically idea is work on actual intelligence ideas but test on simple problems.
- What is the best way to benchmark generative modelling, is log-likelihood useful
- We do not have any good way and log-likelihood is not a good way. Is Turing test good measure for AI? It does not include concept off being less/more wrong, it should not be binary.
- In the regime of large-scale data, log-prob matters. However, we are not even close there so may be we should not care now and care later. Log likelihood will always have leaking probability.
- Thing that we care most requires verry little number of bits (like edges etc) so log-likelihood is not a good choice.
- You can consider even generating the model
- Visual inspection seems like the only way during training to poke
- DCGAN stable up to 64x64 (issues mode dropping and underfitting)
- Use the hacks from Salimans paper about stability
- HACKS:
- Normalize inputs -1/+1 and use tanh for generator output
- use max log D instead of min(log 1 -D) (flip labels while training the generator)
- Use spherical z from Tom White "Sampling Generative Models"
- Batch Norm: Use it based on only for reals or fakes so never combine real and fake
- Avoid sparse gradients. Use LeakyReLu instead of ReLu. AveragePooling/Strided convolution instead of max-pooling. ConvTranspose2d + stride instead of PixelShuffle
- Label Smoothing and/or make the labels noisy for the discriminator w/ some probability
- Use DC-GAN and if not possible try to use combination like KL+GAN (inpainting stuff) or VAE + GAN
- Pfau & Vinyals 2016 / RL tricks helps w/ Experience Replay etc.
- ADAM params in DCGAN works everywhere :)
- If GAN is working it means loss will have a small variance but still it will vary. If it always decreases, it is fooling w/ garbage
- Can you balance via loss stats? It generally does not work
- Improved GAN paper/code
- Add some noise to input and then decrease it through time
- CGAN: Use an embedding layer of 120 dimension and upsample to match image channel
- GAN vs Actor Critic: They are the same thing in TD-error case
- GAN Game is POMDP since Generator never sees the environment
- Main idea: Indirect inference/Prosterior free inference/Adverserial learning all are same thing so let's read each other and do not loose time.
- Same as C-GAN but do not give the class label to the discriminator and make D estimate class as well
- ??? This actually makes sense if you think from NCE angle
- GAN: min_G max_D f
- Instead they do max_D min_G f
- The key is unroll the max_D through time and gradient through multiple iterations of it so generator sees the context/algorithm of the discrimantor.
- To make the loss structuted they put the adverserial as l(x,s) s is simply sum of class labels* class segments. But labels are GT
- tonywu95/eval_gan
- Manifold Opt using Geometry
- In general theoretical analysis of first order gradient descent on manifolds
- Convex algorithm for metric learning
- http://suvrit.de/papers/sra_hosseini_chapter.pdf
- https://arxiv.org/abs/1602.06053
- https://arxiv.org/pdf/1605.07147v1.pdf
- https://arxiv.org/pdf/1602.06053v1.pdf
- https://arxiv.org/pdf/1507.08366v2.pdf
- http://suvrit.de/gopt.html