Secondly, the MSE perform is non-convex for binary classification. In easy, phrases if a binary classification mannequin is skilled with MSE Price perform it’s not assured to reduce the Price perform. It is because MSE perform expects real-valued inputs in vary(-∞, ∞), whereas binary classification fashions output chances in vary(zero,1) by means of the sigmoid/logistic perform. Let’s visualize:
When the MSE perform is handed a worth that’s unbounded a pleasant U-shaped(convex) curve is the end result the place there’s a clear minimal level on the goal worth(y). Alternatively, when a bounded worth from a Sigmoid perform is handed to the MSE perform the end result just isn’t convex; on one aspect the perform is concave whereas on the opposite aspect the perform convex and no clear minimal level. So, if accidentally a binary classification neural community is initialized with weights that are giant in magnitude such that it lands on the concave a part of the MSE Price perform gradient descent is not going to work and consequently, weights could not replace or enhance very slowly(do that out in coding part). This is likely one of the explanation why neural networks needs to be fastidiously initialized with small values when coaching.
On a ultimate notice, MSE is an effective selection for a Price perform after we are doing Linear Regression(i.e becoming a line by means of knowledge for extrapolation). Within the absence of any data of how the info is distributed assuming regular/gaussian distribution is completely affordable.