One common wisdom in the era of deep learning is that "depth trumps width". After all, we are not in the era of wide learning. Deeper networks seem to allow for more expressivity given the compositional nature of layers, holding the number of parameters fixed. From a theoretical perspective, we know that depth-2 networks are universal approximators, but the size of the network is exponential in the input dimension. I'm pretty sure that there has been work showing that dependence for deeper networks is no longer exponential in
More recently, we've entered the era of deep learning, whereby the number of layers are in the hundreds, if not thousands. This feels like an altogether different regime, which I would call really-deep learning. This also gets closer to the question of scaling.
Also, note that a lot of the theoret