Cross-entropy loss measures how badly a model's predicted probability distribution disagrees with the target distribution. In plain language, it becomes small when the model assigns high probability to the correct answer and large when it assigns low probability to the correct answer.
That is why it is one of the most common loss functions in classification and language modeling.
Why Probability-Based Loss Matters
In many machine learning tasks, a model does not just output a label. It outputs probabilities.
Examples:
- the probability a message is spam
- the probability an image contains a cat
- the probability the next token is
modelinstead ofsystem
We need a loss function that tells us not only whether the model is right, but how good or bad its confidence is.
Cross-entropy does that cleanly.
The Core Intuition
Suppose the correct class is cat.
Compare two models:
- Model A says
P(cat) = 0.9 - Model B says
P(cat) = 0.2
Both are making a probabilistic statement, but Model A is much better aligned with reality. Cross-entropy assigns a much smaller loss to Model A than to Model B.
Now compare:
- Model C says
P(cat) = 0.01
If the true answer is still cat, Model C is not just wrong. It is confidently wrong. Cross-entropy punishes that very heavily.
That strong penalty is one of its most important properties.
Why the Logarithm Shows Up
Cross-entropy uses the negative logarithm of the predicted probability assigned to the correct class.
You do not need to fear the formula to understand the behavior:
- if the model assigns probability close to
1, the loss is small - if it assigns probability close to
0, the loss becomes very large
The log transforms probability mistakes into a scale that strongly discourages confident errors.
This is desirable because a model that is confidently wrong is often more dangerous than one that is uncertain.
Binary Cross-Entropy vs Multiclass Cross-Entropy
In binary classification, the target is typically one of two outcomes, such as:
- spam vs not spam
- fraud vs not fraud
- positive vs negative
In multiclass classification, the model chooses among more than two classes, such as:
- dog
- cat
- horse
- airplane
The same intuition remains:
- reward high probability on the correct class
- punish low probability on the correct class
The exact formula changes slightly, but the conceptual role is the same.
A Simple Example
Suppose the true class is dog.
The model predicts:
P(dog) = 0.8P(cat) = 0.1P(horse) = 0.1
This is a good prediction, so the cross-entropy loss is low.
Now suppose another model predicts:
P(dog) = 0.2P(cat) = 0.7P(horse) = 0.1
This is much worse, especially because the model is fairly confident in the wrong answer. Cross-entropy rises sharply.
The loss is therefore sensitive not just to correctness, but to the full probability distribution.
Why Accuracy Alone Is Not Enough
Accuracy only asks whether the top predicted label is correct.
That means these two predictions both count as correct if dog is the true class:
P(dog) = 0.51P(dog) = 0.99
But they are not equally good. The second model is much more certain and usually more useful if that certainty is deserved.
Cross-entropy distinguishes those cases, which is one reason it is a better training signal than accuracy.
Why Cross-Entropy Works Well with Gradient-Based Learning
Cross-entropy interacts naturally with softmax outputs and differentiable optimization.
That matters because training needs smooth gradient information, not just a binary "right or wrong" signal. Cross-entropy tells the optimizer how strongly to correct the model and in which direction.
So it is not only conceptually meaningful. It is also mathematically convenient for learning.
Why It Matters in Language Models
Language models predict the next token as a probability distribution over a vocabulary.
If the correct next token is attention, then:
- assigning high probability to
attentiongives low loss - assigning high probability to a wrong token gives high loss
Training repeatedly on this signal across massive corpora is one reason LLMs can improve token prediction quality so effectively.
In that setting, cross-entropy is not a side detail. It is central to the objective.
Cross-Entropy vs Entropy
The names are similar, which causes confusion.
- entropy measures uncertainty within one distribution
- cross-entropy measures mismatch between a target distribution and a predicted distribution
So cross-entropy becomes especially intuitive once you understand entropy, but they are not identical concepts.
Common Misunderstandings
Is cross-entropy only for classification?
It is most famous there, but any task involving predicted probability distributions can use it, including language modeling.
Does low cross-entropy guarantee good calibration?
Not necessarily. A model can have strong predictive performance and still be imperfectly calibrated.
Why are confident mistakes punished so strongly?
Because assigning near-zero probability to the truth is a very serious failure for a probabilistic model.
FAQ
What is the simplest definition of cross-entropy loss?
It is a loss function that measures how much the model's predicted probabilities disagree with the true target distribution.
Why does cross-entropy use logarithms?
Because logarithms strongly penalize assigning very low probability to the correct outcome.
Why is cross-entropy preferred over accuracy for training?
Because it uses the full probability distribution and provides a richer gradient signal.
Why is cross-entropy important in LLMs?
Because next-token prediction is fundamentally a probability-distribution problem, and cross-entropy measures how good those predictions are.