In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed t = 1, while the horizontal axis represents the value of the prediction y. The plot shows that the Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

Note that should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, , where are the parameters of the hyperplane and is the input variable(s).

When t and y have the same sign (meaning y predicts the right class) and , the hinge loss . When they have opposite signs, increases linearly with y, and similarly if , even if it has the same sign (correct prediction, but not by enough margin).



While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]


where   is the target label,   and   are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]


In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:




The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function   that is given by

Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the l(y) hinge loss, and the x-axis is the parameter t

However, since the derivative of the hinge loss at   is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]


or the quadratically smoothed


suggested by Zhang.[8] The modified Huber loss   is a special case of this loss function with  , specifically  .

