#clasificacion #deep-learning #funciones-activacion #probabilidad #redes-neuronales #softmax

Softmax Function: Activation for Classification

Name: Softmax Function: Activation for Classification
Uploaded: 2023-03-24T16:16:40+01:00
Duration: 22 s
Description: The softmax function, explained step by step.

March 24, 2023 6 min 592 4.1

Gráfica de funciones de activación mostrando la diferencia entre transformaciones lineales y no lineales en redes neuronales

Table of contents

Key takeaways
The formula and its interpretation
A worked example: from logits to probabilities
Softmax and cross-entropy: the inseparable pair
Applications in image, text, and speech classification
Advantages and limitations
Frequently asked questions
Why does Softmax use the exponential function instead of another transform?
What is the difference between Softmax and the sigmoid function?
Conclusion
Sources

Updated: 2026-07-17

The Softmax function is the standard activation function for the output layer in multi-class classification problems: it transforms a vector of arbitrary values (called logits) into a probability distribution where all values are positive and sum to exactly 1. It is the mathematical foundation of every model that outputs distributions over discrete categories, from image classifiers to language models. The same explanation is available in Spanish.

The softmax function, explained step by step.

Key takeaways

Softmax converts a vector of K logits into K probabilities that sum to 1.
The function amplifies differences: the highest logit receives disproportionately more probability.
It is always paired with the cross-entropy loss function during training.
Not suitable for binary classification (use sigmoid) or regression (use a linear function).
In language models, Softmax over the vocabulary is the final operation that produces the probability distribution over the next token: GPT-3 applies this operation over more than 50,000 possible tokens at every generation step.

The formula and its interpretation

For an input vector z = (z₁, z₂, …, z_K) of K logits, the Softmax function produces the probability vector σ where:

$$\sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}, \quad j = 1, \ldots, K$$

Immediate properties:

Positivity: exponentiation always produces positive values, regardless of the logit’s sign.
Normalisation: dividing by the sum guarantees the resulting vector is a valid probability distribution.
Relative monotonicity: if zᵢ > zⱼ, then σ(zᵢ) > σ(zⱼ), meaning the preference order between classes is preserved.
Difference amplification: small differences in logits are amplified in the probability distribution, especially when logits have large magnitudes.

The function is non-linear: the probability of each class depends on all logits simultaneously through the denominator sum.

A worked example: from logits to probabilities

Take a 3-class classifier (say, “cat”, “dog”, “bird”) that outputs the logit vector z = (2.0, 1.0, 0.1). Applying the formula above to each component gives:

Class	Logit (z)	e^z	Probability σ(z)
Cat	2.0	7.39	0.66 (66%)
Dog	1.0	2.72	0.24 (24%)
Bird	0.1	1.11	0.10 (10%)

The three probabilities sum to 1.00. Notice the amplification: a gap of just 1.0 between the “cat” and “dog” logits translates into “cat” receiving almost three times the probability. If we scale all three logits by 3, giving z = (6.0, 3.0, 0.3), the probability of “cat” jumps to 0.95 with the same relative data, which illustrates why the magnitude of the logits, not just their order, drives the model’s confidence.

Neural network diagram showing the output layer with the Softmax function transforming logits into class probabilities

Softmax and cross-entropy: the inseparable pair

In practice, Softmax is never used alone during training: it is always paired with the cross-entropy loss function:

$$L = -\sum_{k=1}^{K} y_k \log(\sigma(z_k))$$

Where y is the one-hot vector of the real label (a 1 at the correct class, 0 elsewhere). This loss penalises the model when it assigns low probability to the correct class. The Softmax + cross-entropy combination has an elegant mathematical property: its gradient with respect to the logits is simply σ(z) − y, making backpropagation numerically stable and efficient.

In most deep learning frameworks (PyTorch, TensorFlow/Keras), a CrossEntropyLoss function internally combines LogSoftmax with NLLLoss for greater numerical stability, avoiding the underflow that can occur when exponentiating very negative logits. The official PyTorch documentation details this implementation in its torch.nn.Softmax reference.

Applications in image, text, and speech classification

Image classification. The final layer of a classification CNN (ResNet, EfficientNet, ViT) applies Softmax over the K output logits, where K is the number of classes. An ImageNet classifier produces 1000 probabilities with Softmax.

Language models. GPT, BERT, and any autoregressive model applies Softmax over a vocabulary of 30,000 to 100,000 tokens to produce the distribution over the next token. The scale of the denominator in these models makes efficient implementation (with techniques like FlashAttention) critical. See NLP advances.

Text and speech classification. Sentiment analysis (positive/negative/neutral), intent classification in dialogue systems, language identification: all use Softmax in the output layer.

Graph of the sigmoid function showing its saturation at the extremes, in contrast with Softmax's behaviour over vectors — Graph of the sigmoid function showing its saturation at the extremes, in contrast with Softmax’s behaviour over vectors

Advantages and limitations

Advantages:

Produces directly interpretable probabilities, not just arbitrary scores.
Its derivative combined with cross-entropy is numerically stable and computationally simple.
Compatible with probabilistic calibration: a well-trained model can produce probabilities that correspond to real frequencies.

Limitations:

Not suitable for binary classification: for two classes, sigmoid is more efficient (Softmax over two logits is mathematically equivalent but redundant).
Exaggeration of differences: if logits have very large magnitudes, Softmax can assign almost all probability to one class, generating high-confidence predictions that do not always match reality. Mitigated with temperature scaling.
Not designed for simultaneous multiple classes (multilabel): if an instance can belong to several classes at once, per-class sigmoid is the correct choice, not Softmax.

Temperature scaling. Dividing logits by a temperature T before applying Softmax controls the “confidence” of predictions: T > 1 softens the distribution (more uncertainty); T < 1 sharpens it. This technique is fundamental in generative language models to balance the creativity of generated text against its coherence.

To see Softmax’s role in the broader context of activation functions, see mathematical formulation of artificial neural networks and linear function as an activation function. For practical applications in image classification and vision, see image analysis and computer vision.

Frequently asked questions

Why does Softmax use the exponential function instead of another transform?

Because the exponential guarantees three properties at once: it is always positive (necessary for the result to be a probability), it is monotonically increasing (it preserves the order of the logits), and it amplifies large differences smoothly and in a differentiable way. Other positive functions exist, but none combine these three properties with as simple a gradient once paired with cross-entropy.

What is the difference between Softmax and the sigmoid function?

The sigmoid computes an independent probability for each output, useful in binary or multilabel classification where classes do not compete with each other. Softmax computes a joint distribution over K mutually exclusive classes that always sums to 1: the classes compete for the same probability mass.

Conclusion

The Softmax function is the bridge between a neural network’s internal logic and interpretable decision-making: it converts arbitrary numbers into probabilities that sum to exactly 1. Its ubiquity in multi-class classification and language models makes it a function that every deep learning practitioner must understand thoroughly. Knowing its limitations, especially overconfidence with large logits and incompatibility with multilabel tagging, is as important as knowing how to apply it correctly.

Sources: [1] Softmax function (Wikipedia)^[1], [2] CS231n: Linear classification, Softmax classifier (Stanford)^[2], [3] torch.nn.Softmax (official PyTorch documentation)^[3], [4] Deep Learning, ch. 6: Deep Feedforward Networks (Goodfellow, Bengio, and Courville)^[4].

Sources

Route: The Neuron and Activation Functions

Softmax Function: Activation for Classification

Key takeaways

The formula and its interpretation

A worked example: from logits to probabilities

Softmax and cross-entropy: the inseparable pair

Applications in image, text, and speech classification

Advantages and limitations

Frequently asked questions

Why does Softmax use the exponential function instead of another transform?

What is the difference between Softmax and the sigmoid function?

Conclusion

Sources

AI explained without the hype

Share this article

Was this article helpful?

Related posts

OpenRouter: A Gateway for AI Models

browser-use: agents that browse the web

Firecrawl: Web Data for Agents

Composio: Tools and Integrations for Agents