#TechToolboxThursday : Overfitting vs Underfitting
Reading Time :
Day 14/30.
It took me such a long time to figure out what I wanted to write about in this section - just because this field (data science / data-driven decision making) is so vast and pinning down one topic is exhausting. However, having self-taught myself a lot of these topics and although being fairly new to the “tech-industry”, if there is one thing that I have learned is that your knowledge of basics can take you a very long way! There truly are no complex ideas, just many simple building blocks combined together. If you have been in this field for a while, then this post might be a repetition for you but it will hopefully help the people just starting out.
The difference between overfitting and underfitting is not only a very common interview question, but also determines how you are going to tune your model such that it best fits the data. For the purposes of this post, I will assume that you understand what the following terms mean : training data, test data and model. If not, here is a quick primer -
A machine learning model is a function which learns the relationship between a given input and output. The input usually consists of some features which the model maps to a prediction (value or label). This input and output on which the model is trained is called the training data. The trained model is then evaluated on a testing set where features are input into the model to make predictions.
If we remember linear regression and fitting a line y = mx + c from high school mathematics, with m being the slope of the line, c being the y-intercept, that is the simplest kind of a machine-learning model. Here is a quick example :
Image 1
As you can notice, the red line is the model, in this case the straight line which fits our training data. How well the model fits our data is decided by calculating a metric, usually an error. As the name suggests, we want to minimize the error. Let’s say the error is simply calculated by finding a difference between the predicted value for a point and the actual value. The values m and c are called parameters, which may be tuned to change how the model fits the data. Say we tune the parameters, and the model now looks something like :
Image 2
We can intuitively say that the error has increased in this case just because the model passes through lesser points than the first one. Sometimes, you can change the function itself. For example, the line is a linear function, maybe a polynomial can fit the data better.
Image 3
As you can see, the polynomial here fits all the points in the training data and in this particular case, the error is 0.
Now we come to the main point : overfitting or underfitting.
Just because the error of the model on the training data is high or low, does not make the model bad or good. We want the model to not only fit the training data, but any random data we might input to make a prediction. This is generalizability - a very important feature of the model. But the model needs to be just the right amount of general - not too general, not too less general (if you know what I mean). This means that image 2 above might be too general where it predicts values very loosely, and thus the error is high. This is called Underfitting. It is usually a result of the model not being complex enough to fit the data. When the model underfits, it has a high error and high bias.
Intuitively, the converse is true for overfitting. It happens when the model is not general enough and thus fits a lot of the training data, often picking up unnecessary noise as well. Image 3 is a good example for the same. Overfitting usually happens when the model is too complex for the data that we have and thus adding more data should help to generalize the model. When the model overfits, it has a high error and high variance.
The gist is that we want a model that does not memorize the input training data and should generalize on unseen data.
This idea seems simple but is one of the key ways to deduce why your model isn’t performing well and tune the hyper parameters accordingly. We want a model with the appropriate complexity, and that may not be the one with the least error on the training data. This is where a validation set comes into play! Let me know in the comments below if you would like to read about how to avoid / fix overfitting and underfitting a model.