Making sense of Network Pruning

Why are we even talking about this?

Well, Because in past years, most of AI researchers didn't talk about this. Majority was focused on increasing 1% imagenet accuracy even if it makes model size 3x (It have its own advantages). But now, we have good accuracy with models in GBs and we can't deploy them (more problematic for edge devices).

Do we have some direction to solve this issue?🙄

Umm.. Yes. While designing models, one thing researchers find particularly interesting is that most of the weights in neural networks are redundant. They don't contribute in increasing accuracy (sometimes even decrease).

So, how pruning leverage this observation?

In Pruning, we rank the neurons in the network according to how much they contribute. The ranking can be done according to the L1/L2 mean of neuron weights, their mean activations, the number of times a neuron wasn’t zero on some validation set, and other creative methods. The simplest method being ranking by sorting absolute values of weights. Then remove the low ranking neurons from the network, resulting in a smaller and faster network.

Does it work?

You might not believe it but it works exceptionally well.🔥 Even simplest methods can remove 90% connections. More attentive approaches can remove even upto 95% weights without any significant accuracy loss (sometimes even gain). That's crazy I know.

Why ain't everybody using it?

Well, if you look into implementations of major Deep learning frameworks, you will find they make heavy use of GEMM operations through BLAS libraries. These libraries are very efficient in computing dense matrix multiplications. But, when, we start removing weights by pruning, it creates sparsity in matrices. Even after 90% less calculations, sparse matrices takes more time for matrix multiplication.

Can this problem be solved?

To overcome this problem, research have been splitted in mainly two directions.

Creating efficient sparse algebra libraries.
Structured Pruning - Pruning whole layers, filters or channels instead of particular weights.

Which approach is more practical?

Actually, I haven't seen much progress in the 1st approach but a lot of papers are getting published improving 2nd approach. For the time being, pruning whole filters and channels is better if you want to compress your models.

Any tools to try it myself?

Although almost all big players are working on this, I find Tencent's Pockeflow and Intel's Distiller are only working frameworks (Upto my knowledge. Let me know if you know some better tools).