Earlier than coaching and evaluating a machine studying algorithm reminiscent of neural networks, we have to outline two main capabilities:
- A loss perform: It’s the bread and butter of recent machine studying. We’d like it to measure the mannequin error (price) on coaching batches. It must be differentiable with the intention to backpropagate the error within the neural community and replace the weights.
- An analysis perform: It ought to characterize the ultimate analysis metric we actually care about. In contrast to the loss perform it must be extra intuitive to grasp what the mannequin efficiency will appear to be in the true world.
If we wish to use the F1-score as an analysis metric, there are two doable methods to maximise it:
Technique 1: Maximize the F1-score by thresholding
When the mannequin generates likelihood values, we have to set a threshold for every label to get binary predictions. For instance, we will predict 1 (a film is about Motion) if the likelihood for Motion is above the brink of zero.5, in any other case we predict zero (no Motion). The optimum threshold that maximizes the F1-score for every label is someplace between zero and zero.5 (There’s a mathematical proof on that however it’s not lined on this submit). We often seek for that threshold on a held-out validation set.
Technique 2: Embed the F1-score into the loss perform
This might be the simple method in case you ever wish to optimize immediately for the F1 metric. The loss perform gives not solely a measure of mannequin error, it’s within the coronary heart of the training course of defining the right way to finest match the information to realize optimum targets. For some cause although, embeddding the F1-score within the loss perform isn’t a standard follow.
Why is it uncommon to have the F1-score within the loss perform?
The issue of the F1-score is that it’s not differentiable and so we can not use it as a loss perform to compute gradients and replace the weights when coaching the mannequin. Do not forget that the F1-score wants binary predictions (zero/1) to be measured.
Often we use the
binary cross-entropy loss which represents the unfavorable log probability
-log(p) of an remark being of a particular class with the mannequin predicting a likelihood
p for that class. Usually this loss works nicely and is extensively used to coach classifiers utilizing Technique 1, however it doesn’t immediately align with the F1-score we wish to maximize.
We will modifiy the F1-score to make it differentiable. As an alternative of computing the variety of True Positives, False Positives, False Negatives as discrete integer values, we will compute them as a steady sum of probability values through the use of possibilities with out making use of any threshold.
Let’s see two examples that might assist perceive this transformation:
Instance 1: If the goal is 1 for a film being Motion and the mannequin prediction for Motion is zero.eight, it can rely as:zero.eight x 1 = zero.eight TP (as a result of the goal is 1 and the mannequin predicted 1 with zero.eight likelihood)zero.2 x 1 = zero.2 FN (as a result of the goal is 1 and the mannequin predicted zero with zero.2 likelihood)zero.eight x zero = zero FP (as a result of the goal is 1 not zero, situation unfavorable isn't legitimate)zero.2 x zero = zero TN (as a result of the goal is 1 not zero, situation unfavorable isn't legitimate)
Instance 2: If the goal is zero for a film being Motion and the mannequin prediction for Motion is zero.eight, it can rely as:zero.eight x zero = zero TP (as a result of the goal is zero not 1, situation constructive isn't legitimate)zero.2 x zero = zero FN (as a result of the goal is zero not 1, situation constructive isn't legitimate)zero.eight x 1 = zero.eight FP (as a result of the goal is zero and the mannequin predicted 1 with zero.eight likelihood)zero.2 x 1 = zero.2 TN (as a result of the goal is zero and the mannequin predicted zero with zero.2 likelihood)
We’ll name this model a
soft-F1-score. Under, you possibly can see the code that implements it on a batch of predictions in TensorFlow.
There are specific issues to contemplate right here:
- The associated fee for every label is definitely
1 — soft-F1for that label. If you wish to maximize soft-F1, you need to reduce
1 — soft-F1.
- You may substitute
Recallwithin the definition of
soft-F1and get a extra direct formulation primarily based on phrases of TP, FP and FN. The explanation you wish to do that’s as a result of the harmonic imply expression for F1 is undefined when TP = zero, however the translated expression is outlined.
F1 = 2 * TP / (2 * TP + FN + FP)
- The full price in a batch of observations would be the common price on all labels. We’ll name it a
macro soft-F1 loss.
- You must make it possible for the batch dimension is sufficiently big to see a consultant
macro soft-F1 losswhereas coaching.
Studying curves (optimization and analysis metrics)
Let’s assume that for simplicity causes, we wish to use just one threshold for all labels to transform any likelihood worth right into a binary prediction. In different phrases, as a substitute of maximizing the efficiency by thresholding for every label aside, we’ll now take into account the default threshold of zero.5 and attempt to maximize the macro F1-score @ threshold zero.5 by optimizing the macro soft-F1 loss that we outlined earlier (Technique 2).
In contrast to the macro soft-F1 which makes use of probability phrases, we’ll use the brink of zero.5 within the implementation of the macro F1-score.
On the next studying curves, we observe the change within the loss metric (macro soft-F1) and the analysis metric (macro F1-score @ threshold zero.5) throughout the coaching course of.
You might discover on the training curves lower within the macro soft-F1 loss to a stage close to zero.69 is related to a rise within the macro F1-score to a stage close to zero.33. These two values virtually complement to 1. Do not forget that the macro soft-F1 loss we outlined was really the macro of 1- soft-F1 that we would have liked to attenuate. It is a first indicator that the macro soft-F1 loss is immediately optimizing for our analysis metric which is the macro F1-score @ threshold zero.5.