Some things, like the "self-operated napkin", are just too complicated!

Do we really need over 100,000 free parameters to build a good MNIST classifier? It turns out that we can eliminate 50-90% of them.

How many parameters is enough?

The fruit fly was to genetics what the MNIST dataset is to deep learning: the ultimate case study. The idea is to classify handwritten digits between 0 and 9 using 28x28 pixel images. A few examples of these images are shown:

Examples taken from the MNIST dataset.

Ughh, not MNIST tutorials again. There is an endless supply of tutorials that describe how to train an MNIST classifier. Looking over them recently, I noticed something: they all use a LOT of parameters. Here are a few examples:

Author Package Type Free params Test acc.
TensorFlow TF FC 7,850 92%
PyTorch PyTorch CNN 21,840 99.0%
Elite Data Science Keras CNN 113,386 99\(^+\)%
Machine Learning Mastery Keras CNN 149,674 98.9%
Lasagne Lasagne CNN 160,362 99\(^+\)%
Caffe Caffe CNN 366,030 98.9%
Machine Learning Mastery Keras FC 623,290 98%
Lasagne Lasagne FC 1,276,810 98-99%
Tensorflow TF CNN 3,274,625 99.2%

It would seem that you have two options: use a small number of weights and get low accuracy (TensorFlow’s logistic regression example) or use 100,000\(^+\) weights and get 99\(^+\)% accuracy (the PyTorch example is a notable exception). I think this leads to a common misconception that the best way to gain a few percentage points in accuracy is to double the size of your model.

MNIST lite. I was interested in how slightly smaller models would perform on the MNIST task, so I built and trained a few:

Package Structure Type Free params Test accuracy
PyTorch 784 \(\mapsto\) 16 \(\mapsto\) 10 FC 12,730 94.5%
PyTorch 784 \(\mapsto\) 32 \(\mapsto\) 10 FC 25,450 96.5%
PyTorch 784 \(\mapsto\) (6,4x4) \(\mapsto\) (6,4x4) \(\mapsto\) 25 \(\mapsto\) 10 CNN 3,369 95.6%
PyTorch 784 \(\mapsto\) (8,4x4) \(\mapsto\) (16,4x4) \(\mapsto\) 32 \(\mapsto\) 10 CNN 10,754 97.6%

You can find the code on my GitHub. As you can see, it’s still possible to obtain models that get 95\(^+\)% accuracy with fewer than 10\(^4\) parameters. That said, we can do even better…

Optimizing subspaces

“Make everything as simple as possible, but not simpler.” – Albert Einstein

The trick. When I interviewed with anonymous at anonymous earlier this year, we discussed the idea of optimizing subspaces. In other words, if the vector \(\theta\) contains all the parameters of a deep network, you might define \(\theta = P \omega\) where the vector \(\omega\) lives in some smaller-dimensional space and \(P\) is a projector matrix. Then, instead of optimizing \(\theta\), you could optimize \(\omega\). With this trick, we can choose an arbitrary number of free parameters to optimize without changing the model’s architecture. In math, the training objective becomes:

\[\begin{align} \omega &= \arg\min_{\mathbf{\omega}} -\frac{1}{n} \sum_X (y\ln \hat y +(1-y)\ln (1-\hat y))\\ &\mathrm{where} \quad \hat y = f_{NN}(\theta, X) \quad \mathrm{and} \quad \theta = P \omega \end{align}\]

If you want to see this idea in code, check out my subspace-nn repo on GitHub.

Results. I tested this idea on the model from the PyTorch tutorial because it was the smallest model that achieved 99\(^+\)% test accuracy. The results were fascinating.

As shown below, cutting the number of free parameters in half (down to 10,000 free parameters) causes the test accuracy to drop by only 0.2%. This is substantially better than the “MNIST lite” model I trained with 10,754 free parameters. Cutting the subspace down to 3,000 free parameters produces a test accuracy of 97% which is still pretty good.

Framework Free parameters Test accuracy
PyTorch 21,840 99.0%
PyTorch 10,000 98.8%
PyTorch 3,000 97.0%
PyTorch 1,000 93.5%

Ultra small subspaces. We can take this idea even futher and optimize our model in subspaces with well below 1,000 parameters. Shown below are test accuracies for subspaces of size 3, 10, 30, 100, 300, and 1000. Interestingly, I tried the same thing with a fully connected model and obtained nearly identical curves.

Performing optimization in very small subspaces. Some of these training trajectories have not reached convergence; the 1,000-parameter model, for example, will converge to 93.5% accuracy after several more epochs.

Takeaways

Update. My friend, (anonymous), just finished a paper about this. Much more #rigorous than this post and definitely worth checking out: “solving the cart-pole RL problem is in a sense 100 times easier than classifying digits from MNIST.”

Literature. The idea that MNIST classifiers are dramatically overparameterized is not new. The most common way to manage this issue is by adding a sparsity term (weight decay) to the loss function. At the end of the day, this doesn’t exactly place a hard limit on the number of free parameters. One interesting approach, from Convolution by Evolution\(^\dagger\), is to evolve a neural network with 200 parameters. The authors used this technique to train a denoising autoencoder so it’s difficult to directly compare their results to ours.

There are several other papers that try to minimize the number of free parameters. However, I couldn’t find any papers that used the subspace optimization trick. Perhaps it is a new and interesting tool.

Bloat. The immediate takeaway is that most MNIST classifiers have parameter bloat. In other words, they have many more free parameters than they really need. You should be asking yourself: is it worth doubling the parameters in a model to gain an additional 0.2% accuracy? I imagine that the more complex deep learning models (ResNets, VGG, etc.) also have this issue.

Training a model with more parameters than you’ll need is fine as long as you know what you’re doing. However, you should have a general intuition for how much parameter bloat you’ve introduced. I hope this post helps with that intuition :)

      \(^\dagger\)Parts of this paper are a little questionable…for example, the ‘filters’ they claim to evolve in Figure 5 are just circles.