Bias and Variance


Ah yes, the obligatory xkcd comic and this time it’s showing an example of high variance in everyday life.

When building a model, one of the first things we look at are the prediction errors of the dev and test sets. These errors can be decomposed into two components known as bias and variance. Bias is an error that results from incorrect assumptions the model makes about the training data and variance is an error that happens when the model is sensitive to small changes in the training data.

If you build a model and observe that it has a hard time predicting data it’s already seen (i.e. it has a low training accuracy), then your model doesn’t fit the data well and so we say it has high bias. On the other hand, if your model is too sensitive to changes in the training data, then it will try to predict random noise rather than the intended outputs and thus will overfit your data. This usually results in a very high training accuracy and we say the model has high variance.

Since we want a model that minimizes bias and variance, a trade-off arises. We could have a model with a high training accuracy, but performs poorly on the dev and test sets. In this case, it’s better to sacrifice some of that training accuracy in exchange for better performance on the dev and test sets. This trade-off is perhaps more intuitively understood by the image below.

Bias-Variance Decomposition

Although the bias-variance trade-off might feel experimentally familiar, it should be mathematically verified that we can decompose prediction errors in terms of bias and variance (and an unavoidable error term).

So the data we have comes from some underlying function, $f$, mixed with some noise, $\epsilon$. Let’s represent this as $y = f + \epsilon$ and note that we assume $\epsilon$ to have a normal distribution with a mean of $0$ and variance $\sigma^2$. The underlying function is what we’re trying to approximate with some model, $\hat{f}$, and so to show that the error of $\hat{f}$ in predicting $y$ (i.e. the mean squared error) can be seen as a combination of bias, variance, and an irreducible error term (which is an inevitable result of the noise, $\epsilon$, in the data), we need to show that

$$\E\big[(y - \hat{f})^2\big] = \bias\big[\hat{f}\big]^2 + \var\big[\hat{f}\big] + \sigma^2$$

where

$$\bias\big[\hat{f}\big] = \E\big[\hat{f} - f\big]$$

and

$$\var\big[\hat{f}\big] = \E\Big[\hat{f}^2\Big] - \E\big[\hat{f}\big]^2$$

So,

\begin{align} \E\big[(y - \hat{f})^2\big] &= \E\Big[y^2 - 2y\hat{f} + \hat{f}^2\Big] \\ &= \E\big[y^2\big] + \E\Big[\hat{f}^2\Big] - \E\big[2y\hat{f}\big] \end{align}

And since $\var[y] = \E\big[y^2\big] - \E[y]^2$ and $\var\big[\hat{f}\big] = \E\Big[\hat{f}^2\Big] - \E\big[\hat{f}\big]^2$, then with a little rearranging we have,

$$\E\big[y^2\big] = \var[y] + \E[y]^2$$

$$\E\Big[\hat{f}^2\Big] = \var\big[\hat{f}\big] + \E\big[\hat{f}\big]^2$$

So now

\begin{align} \E\big[(y - \hat{f})^2\big] &= \var[y] + \E[y]^2 + \var\big[\hat{f}\big] + \E\big[\hat{f}\big]^2 - 2\E[y]\E[\hat{f}] \\ &= \var[y] + \E[y]^2 + \var\big[\hat{f}\big] + \E\big[\hat{f}\big]^2 - 2f\E[\hat{f}] \\ &= \var[y] + \var\big[\hat{f}\big] + \Big(f^2 - 2f\E[\hat{f}] + \E[\hat{f}]^2\Big) \\ &= \var[y] + \var[\hat{f}] + (f - \E[\hat{f}])^2 \end{align}

Since $\bias[\hat{f}]^2 = \big(\E\big[\hat{f}\big] - f\big)^2 = \big(f - \E\big[\hat{f}\big]\big)^2$, we now have

$$\E\big[(y - \hat{f})^2\big] = \var[y] + \var[\hat{f}] + \bias\big[\hat{f}\big]^2$$

And since $\var[y] = \sigma^2$, we finally have

$$\E\big[(y - \hat{f})^2\big] = \bias\big[\hat{f}\big]^2 + \var\big[\hat{f}\big] + \sigma^2$$

For more justification in each of these steps, refer to the derivation procedure posted on Berkeley’s machine learning blog.

Addressing High Bias and High Variance

To minimize bias and variance, we need to have a game-plan for how to address high bias and high variance. In the case of high bias, we can:

• Choose a more complex model that can learn from the data
• Make sure model assumptions about data are verified
• Train the model for a greater amount of time

To address high variance, we have a few more options at our disposal:

• Get more data! Theoretically, variance approaches zero as the number of samples approaches infinity. However, collecting more labeled data can be costly so it’s easier in most cases, like for images, to just generate more data via augmentation using techniques like:
• Mirroring
• Random cropping
• Rotation
• Color shifting (using PCA augmentation as explained in the AlexNet papers)
• Normalize the data which has the effect of making the cost function more symmetric so gradient descent takes less time since you can use larger learning rates
• Handicap your model so that it doesn’t become too complex via regularization techniques. In the case of neural networks, this has the effect of encouraging the weights to get close to zero which essentially prunes the network. Some regularization techniques for neural networks include:
• Dropout, so that the neural network doesn’t rely on too much on one feature and is forced to spread out the weights
• Stop earlier in gradient descent, but the downside of this is that you can’t decouple your cost function and reducing overfitting
• Initialize your weights appropriately using methods like the Xavier or He methods which helps to avoid vanishing or exploding gradients
• Ensemble multiple models

In a separate post, we’ll cover cross-validation which can have a desirable effect on both bias and variance.

Conclusion

As you can see, the model(s) you choose, the parameters you settle for, and the data you collect all have an effect on bias and variance. Ideally, there’s a perfect balance out there for any given situation but to find it we’ll have to rely on a bit of intuition, more experimentation, and robust methods.