Data Science and Evaluation

What Is Cross-Entropy Loss?

Learn what cross-entropy loss measures, why it punishes confident wrong predictions so strongly, and how it connects probability, classification, and language modeling.
Cover image for What Is Cross-Entropy Loss?
Cross-EntropyLoss FunctionsClassificationMachine Learning

Cross-entropy loss measures how badly a model's predicted probability distribution disagrees with the target distribution. In plain language, it becomes small when the model assigns high probability to the correct answer and large when it assigns low probability to the correct answer.

That is why it is one of the most common loss functions in classification and language modeling.

Why Probability-Based Loss Matters

In many machine learning tasks, a model does not just output a label. It outputs probabilities.

Examples:

  • the probability a message is spam
  • the probability an image contains a cat
  • the probability the next token is model instead of system

We need a loss function that tells us not only whether the model is right, but how good or bad its confidence is.

Cross-entropy does that cleanly.

The Core Intuition

Suppose the correct class is cat.

Compare two models:

  • Model A says P(cat) = 0.9
  • Model B says P(cat) = 0.2

Both are making a probabilistic statement, but Model A is much better aligned with reality. Cross-entropy assigns a much smaller loss to Model A than to Model B.

Now compare:

  • Model C says P(cat) = 0.01

If the true answer is still cat, Model C is not just wrong. It is confidently wrong. Cross-entropy punishes that very heavily.

That strong penalty is one of its most important properties.

Why the Logarithm Shows Up

Cross-entropy uses the negative logarithm of the predicted probability assigned to the correct class.

You do not need to fear the formula to understand the behavior:

  • if the model assigns probability close to 1, the loss is small
  • if it assigns probability close to 0, the loss becomes very large

The log transforms probability mistakes into a scale that strongly discourages confident errors.

This is desirable because a model that is confidently wrong is often more dangerous than one that is uncertain.

Binary Cross-Entropy vs Multiclass Cross-Entropy

In binary classification, the target is typically one of two outcomes, such as:

  • spam vs not spam
  • fraud vs not fraud
  • positive vs negative

In multiclass classification, the model chooses among more than two classes, such as:

  • dog
  • cat
  • horse
  • airplane

The same intuition remains:

  • reward high probability on the correct class
  • punish low probability on the correct class

The exact formula changes slightly, but the conceptual role is the same.

A Simple Example

Suppose the true class is dog.

The model predicts:

  • P(dog) = 0.8
  • P(cat) = 0.1
  • P(horse) = 0.1

This is a good prediction, so the cross-entropy loss is low.

Now suppose another model predicts:

  • P(dog) = 0.2
  • P(cat) = 0.7
  • P(horse) = 0.1

This is much worse, especially because the model is fairly confident in the wrong answer. Cross-entropy rises sharply.

The loss is therefore sensitive not just to correctness, but to the full probability distribution.

Why Accuracy Alone Is Not Enough

Accuracy only asks whether the top predicted label is correct.

That means these two predictions both count as correct if dog is the true class:

  • P(dog) = 0.51
  • P(dog) = 0.99

But they are not equally good. The second model is much more certain and usually more useful if that certainty is deserved.

Cross-entropy distinguishes those cases, which is one reason it is a better training signal than accuracy.

Why Cross-Entropy Works Well with Gradient-Based Learning

Cross-entropy interacts naturally with softmax outputs and differentiable optimization.

That matters because training needs smooth gradient information, not just a binary "right or wrong" signal. Cross-entropy tells the optimizer how strongly to correct the model and in which direction.

So it is not only conceptually meaningful. It is also mathematically convenient for learning.

Why It Matters in Language Models

Language models predict the next token as a probability distribution over a vocabulary.

If the correct next token is attention, then:

  • assigning high probability to attention gives low loss
  • assigning high probability to a wrong token gives high loss

Training repeatedly on this signal across massive corpora is one reason LLMs can improve token prediction quality so effectively.

In that setting, cross-entropy is not a side detail. It is central to the objective.

Cross-Entropy vs Entropy

The names are similar, which causes confusion.

  • entropy measures uncertainty within one distribution
  • cross-entropy measures mismatch between a target distribution and a predicted distribution

So cross-entropy becomes especially intuitive once you understand entropy, but they are not identical concepts.

Common Misunderstandings

Is cross-entropy only for classification?

It is most famous there, but any task involving predicted probability distributions can use it, including language modeling.

Does low cross-entropy guarantee good calibration?

Not necessarily. A model can have strong predictive performance and still be imperfectly calibrated.

Why are confident mistakes punished so strongly?

Because assigning near-zero probability to the truth is a very serious failure for a probabilistic model.

FAQ

What is the simplest definition of cross-entropy loss?

It is a loss function that measures how much the model's predicted probabilities disagree with the true target distribution.

Why does cross-entropy use logarithms?

Because logarithms strongly penalize assigning very low probability to the correct outcome.

Why is cross-entropy preferred over accuracy for training?

Because it uses the full probability distribution and provides a richer gradient signal.

Why is cross-entropy important in LLMs?

Because next-token prediction is fundamentally a probability-distribution problem, and cross-entropy measures how good those predictions are.

Start here

Need this level of technical clarity inside the actual product work?

The studio handles the implementation side as seriously as the editorial side: architecture, delivery, and the interfaces people are expected to live with.