Entropy is a measure of uncertainty in a probability distribution. In information theory, it can also be understood as the average amount of surprise or information produced by outcomes sampled from that distribution.
The key idea is simple: a highly predictable distribution has low entropy, and a highly uncertain one has high entropy.
A Quick Intuition
Compare two random systems:
- a coin that always lands heads
- a fair coin
The first system has almost no uncertainty. The second has much more uncertainty because both outcomes remain plausible before the flip.
Entropy captures that difference numerically.
So entropy is not about disorder in a vague poetic sense. It is about how uncertain the next outcome really is.
Surprise and Probability
Rare events are more surprising than common events.
If a system usually outputs one result and suddenly produces something unusual, that event carries more information. Information theory formalizes that intuition by linking surprise to probability.
Entropy then becomes the average surprise across all possible outcomes, weighted by how likely they are.
That is why entropy is fundamentally a probabilistic concept.
Low Entropy vs High Entropy
Low-entropy distributions are concentrated.
Examples:
- one class has probability 0.99
- one token is overwhelmingly likely
- one branch outcome dominates
High-entropy distributions are more spread out.
Examples:
- several classes have similar probabilities
- the next token is very uncertain
- a system is undecided among many plausible options
This distinction is useful everywhere from decision trees to language modeling.
Why Entropy Matters in Machine Learning
Machine learning frequently works with probability distributions:
- class probabilities
- next-token probabilities
- uncertainty estimates
- feature splits
Entropy gives us a way to quantify how uncertain those distributions are.
That makes it useful in:
- classification
- decision tree construction
- active learning
- calibration analysis
- information-theoretic interpretations of learning
Entropy in Decision Trees
A classic application is decision tree splitting.
If a dataset is perfectly mixed between labels, entropy is relatively high because the label is uncertain. A good split reduces that uncertainty by separating the classes more clearly.
That is why information gain is built from entropy reduction. A strong split is one that leaves the child nodes more predictable than the parent node.
So entropy helps quantify how much uncertainty a feature removes.
Entropy vs Cross-Entropy
The names are similar because the concepts are related.
- entropy measures uncertainty within one distribution
- cross-entropy measures how well one distribution predicts another
If entropy asks, "How uncertain is this system?", then cross-entropy asks, "How much penalty do I incur when I use my predicted distribution to encode the true one?"
That is why cross-entropy is so important in classification and language modeling.
Entropy in Language Models
Language models output a probability distribution over tokens.
If the model is highly uncertain about what comes next, the distribution has higher entropy. If one token is clearly favored, entropy is lower.
This matters because:
- low-entropy outputs often look more confident
- high-entropy outputs reflect ambiguity or uncertainty
- decoding behavior is shaped by the model's probability distribution
So entropy gives insight into how certain the model is, even before we look at the final sampled token.
Entropy Is Not the Same as Correctness
A low-entropy distribution can still be wrong.
That means a model can be very confident and still assign most of its mass to the wrong outcome. So entropy measures uncertainty, not truth.
This distinction is crucial. Confidence and correctness are related, but not identical.
Why This Matters in Modern ML
Entropy appears in many places because machine learning is full of uncertainty:
- uncertain labels
- uncertain predictions
- uncertain futures
- uncertain feature relevance
If you understand entropy, you can better understand why certain objectives, evaluation tools, and model behaviors make sense mathematically.
It is one of the cleanest bridges between probability theory and practical ML.
FAQ
What is the simplest definition of entropy?
Entropy is a measure of uncertainty in a probability distribution.
Does high entropy mean the model is wrong?
No. It means the model is uncertain. It may still be correct or incorrect.
Why is entropy important in decision trees?
Because good feature splits reduce uncertainty, and entropy gives a natural way to measure that reduction.
How is entropy related to cross-entropy?
Entropy measures uncertainty in one distribution, while cross-entropy measures mismatch between a target distribution and a predicted one.