Learning Curves
A learning curve is a plot of the training and cross-validation error as a function of the number of training points. Note that when we train on a small subset of the training data, the training error is computed using this subset, not the full training set. These plots can give a quantitative view into how beneficial it will be to add training samples.
Learning Curves for a case of high bias (left, d = 2) and high variance (right, d = 20)
On the left plot, we have the learning curve for d = 1. From the above discussion, we know that d = 1 is a high-bias estimator which under-fits the data. This is indicated by the fact that both the training and cross-validation errors are very high. If this is the case, adding more training data will not help matters: both lines have converged to a relatively high error.
In the right plot, we have the learning curve for d = 20. From the above discussion, we know that d = 20 is a high-variance estimator which over-fits the data. This is indicated by the fact that the training error is much less than the cross-validation error. As we add more samples to this training set, the training error will continue to climb, while the cross-validation error will continue to decrease, until they meet in the middle. In this case, our intrinsic error is 1.0 (again, this is artificially set in the code: click on the image to browse the source code), and we can see that adding more data will allow the estimator to very closely match the best possible cross-validation error.
Note
With a degree-20 polynomial, we’d expect the training error to be identically zero for training set size .
Why is this?
It is because when the degrees of freedom are greater than the number of constraints, the problem should be perfectly solvable: a curve can be found which passes through every point (for example, imagine fitting a line to a single point. You’d be very surprised if you got anything but a perfect fit!) In the right-hand plot we see that this (correct) intuition fails in practice. The reason is due to floating-point precision: to perfectly fit these data points with a polynomial requires a fit that oscillates to extreme values in the space between the points (compare to the degree-6 polynomial above). The nature of our dataset means that this oscillation is outside machine precision, so that the resulting fit has a small residual.
No comments:
Post a Comment