What Does Overfitting Mean?
Overfitting is a condition that occurs when a machine learning or deep neural network model performs significantly better for training data than it does for new data.
Overfitting is the result of an ML model placing importance on relatively unimportant information in the training data. When an ML model has been overfit, it can't make accurate predictions about new data because it can't distinguish extraneous (noisey) data from essential data that forms a pattern.
For example, if a computer vision (CV) program's task is to capture license plates, but the training data only contains images of cars and trucks, the learning model might overfit and conclude that having four wheels is a distinguishing characteristic of license plates. When this happens, the CV programming is likely to do a good job capturing license plates on vans, but fail to capture license plates on motorcycles.
The most common causes of overfitting include the following:
- The data used to train the model is dirty and contains large amounts of noise.
- The model has a high variance with data points that are very spread out from the statistical mean and from each other.
- The size of the training dataset is too small.
- The model was created by using a subset of data that does not accurately represent the entire data set.
Techopedia Explains Overfitting
Overfitting is one of the most serious mistakes that can be made when machine learning models are used to make predictions.
Reducing the feature space and parameter space, as well as increasing the sample space can help reduce overfitting. There are a number of other techniques that machine learning researchers can use to mitigate overfitting including:
- Cross-validation
- Regularization
- Early stopping
- Pruning
- Ensembling
Overfitting vs. underfitting
When an algorithm is is either too complex or too flexible, it can end up overfitting and focus on the noise (irrelevant details) instead of the signal (the desired pattern) in training data. When an overfit model makes predictions that incorporate noise, it will still perform quite well on its training data — but perform quite poorly on new data.
Overfitting can be contrasted with underfitting, a condition that occurs when the ML model is so simple that no learning can take place. If a predictive model performs poorly on training data, underfitting is the most likely reason.