My third neural network experiment (second was FIR filter). DFT output is just a linear combination of inputs, so it should be implementable by a single layer with no activation function.
Animation of weights being trained:
Red are positive, blue are negative. The black squares (2336 out of 4096) are unused, and could be pruned out to save computation time (if I knew how to do that).
Even with pruning, it would be less efficient than an FFT, so if the FFT output is useful, probably best to do it externally and provide it as separate inputs?
This at least demonstrates that neural networks can figure out frequency content on their own, though, if it's useful to the problem.
The loss goes down for a while but then goes up. I don't know why:
@endolith That's one idea they explored in the paper, yes. They showed you can learn some underlying sparse structure of otherwise "fully connected" layers using their method. Using it like that is a new kind of matrix/model compression, and it contrasts a lot with the other approach that's popular in DL, pruning.
They also use the ideas of learning the weights of matrices with butterfly-sparsity-patterns and learning a permutation out a discrete set of choices to show that Adam does indeed find solutions that are the usual "fast" linear transforms from signal processing (FFT, fast DCT, fast DST, etc.).