Jaewoo Song
Jaewoo Song

Categories

  • Tech

The most commonly used loss functions in Machine Learning/Deep Learning are Mean Square Error and Cross Entropy loss.



Mean Square Error(MSE)

Literally, “Mean Square Error” which can be obtained by summing the squares of each error between the real label and the result the model predicted and dividing that sum by the size of entire data.

\[\frac{1}{n} \sum_{i=1}^{n}(\hat{y_{i}}-y_{i})^2\]


Here, $\hat{y_{i}}$ is the value predicted by the model and $y_{i}$ is real label.

Obviously, if the difference between the two values is large, then we can conclude that the model couldn’t determine the answer properly, which the error is comparatively large.

MSE is always bigger than or equal to 0.

Because the bigger the difference between the predicted value and actual value becomes, so does the result of the loss function, we can use it as a loss function.



Cross Entropy loss

Cross Entropy loss is represented with the natural log whose base is $e$.

It can be obtained by applying natural log to the predicted value, multiplying it with actual value, and finally multiplying $-1$.

This calculation is basically NLLloss (Negative Log Likelihood loss) and actually Cross Entropy loss is LogSoftmax + NLLloss.

\[-\sum_{i=1}^{n} y_{i} \log_e \hat{y_{i}}\]


So to use Cross Entropy, the data should be one-hot encoded and the output from the model should pass the softmax as an activation function before it is put into the function.

Of course, most deep learning libraries such as Tensorflow and Pytorch supports users to use this function without an additional one-hot encoding procedure since this is included in the function.


The calculation of Cross Entropy loss can be reduced as follows: as only the desired class index is $1$ and the others are $0$, other indexes become meaningless and the value of $y_{i} \log_e \hat{y_{i}}$ which is the spot of the desired class is added.


Anyway, this function is based on $y=-\log_e x$ so the shape of the graph is below.

A basic natural log graph for explaining Cross Entropy loss.


After the softmax, the values are regularized in range of $0$ to $1$ so the function is always positive.

And when the input becomes closer to $1$ then the function goes closer to $0$, and vice versa.

This means that if the prediction becomes closer to the real value, then the loss function becomes smaller and that means it can be used as a loss function.

But, if the input is substantially small, then the function goes to infinity.

To prevent this, usually a small value $\delta$ is added to the input.



Which one is better?

As I found, Mean Square Error is more useful for Regression and Cross Entropy loss is better for Classification.


The reason why Cross Entropy is better for the classification task than MSE is MSE focuses on wrong classified samples more.

As we can see from the calculation, MSE adds the error if there is a difference between the prediction and the label.

On the other hand, Cross Entropy makes wrong samples $0$ and determines how certainly the correct samples are predicted, which can lead to more focus on the right predictions.

That is, MSE determines how the model can predict more samples right and modifies the model parameters according to that decision, but Cross Entropy loss looks how the model chooses the answer more confidently and focuses on modifying parameters onto that.


Then why are these features become the separation between regression and classification?

This is a simple linear regression task.

A graph of linear regression for explaining why Mean Square Error is good for regression task.


The regression is eventually a procedure to make a function which can represent the given samples as many as possible.

So it should be as close as it can be to overall data.

This can be obtained by considering the error between all labels and wrong predictions, so MSE is more acceptable in a regression task.


But in classification, it is important that how many samples I correctly predict.

Of course, reducing error is critical to classification, but for example what if the model predicted $2$ correct answers and $1$ wrong answer?

If we modify the parameters to make the model predict that one sample right, the confidence on correctly chosen samples can be degraded.

This is not what we want in classification.

Therefore, Cross Entropy loss is used, which gives more attention to the right prediction.


Besides, some opinions are saying that Cross Entropy is better because it reaches the convergence point faster than MSE.

But I think it depends more on how we set the hyperparameters, so in my opinion, this is not a standard for choosing a loss function.