# Why is Mean Squared Error (MSE) So Popular?

### Baby’s first loss function

The mean squared error (MSE) is one of many metrics you could use to measure your model’s performance. If you take a machine learning class, chances are you’ll come across it very early in the syllabus — it’s usually baby’s first loss function* for continuous data.

(If you’re fuzzy on what any of the bolded words mean, you might like to follow the links for a gentle intro to each concept.)

You’ve seen what the MSE *is* …but why is it so popular? Why does it seem to be everyone’s favorite scoring function?

There are a few reasons, and some of them are even good reasons.

Why might we wish to calculate the MSE?

1. Performance evaluation: how well is our model doing?

2. Model optimization: is this the best possible fit? Can we get the model closer to our datapoints?

Performance evaluation and optimization are two different goals… and there’s no law of the universe that says you *must* use the same function for both. Understanding this subtlety will mitigate a lot of future confusion if you stick around in applied ML/AI.

For this discussion, I’ll assume you understand how and why a function is used for evaluation versus optimization, so if you’re fuzzy on that, now might be a good time to take a small detour.

When it comes to model evaluation, the MSE is rubbish. Seriously. There’s everything wrong with it as a metric, starting with the fact that it’s on the wrong scale (a problem often solved by taking the square root to work with RMSE instead) but not ending there. It also overweights outliers, making both the MSE and RMSE confusing to interpret. Neither one accurately reflects the meaning that would be most interesting to a person who wants to know how wrong their model is on average. For that, the ideal metric is something called the MAD. And there’s no reason not to use the MAD for evaluation — it’s easy to calculate.

So why is everyone so obsessed with the MSE? Why is it the first model scoring function you learn? Because it’s really great for a different purpose: optimization, not evaluation.

If you want to use an optimization algorithm (or calculus) to quickly find the ideal parameter settings that give you the best — most optimal! — performance, it’s nice to have a convenient function to work with. And it’s hard to beat the MSE for that. There’s a good reason that the first derivative you’re ever taught is x² — in calculus, squares make things super easy. The next things you’re taught in calculus 101 is what to do with constants and sums, since those are super easy too. Guess what? Squares, sums, and constants (1/n) is the whole formula for MSE!

And that, my friends, is the real reason the MSE is so popular. Pragmatic laziness. It’s literally the easiest vaguely sensible function of the errors to optimize. And that’s why it was the one Legendre and Gauss used at the turn of the 19th century for the first ever regression models… and why we still love it today.

But is it perfect for all your needs? And does it outperform other loss functions in all conditions? Certainly not, especially when you’ve got an infestation of outliers in your data.

In practice, you’ll often have two functions you’re working with: a loss function and a separate performance evaluation metric. Learn more about that here.

Now that you know the reason for liking the MSE, you’re also free to choose other loss functions if they’re available to you, especially if you have a lot of computing resources and/or smaller datasets.

0

• 找不到回應