#aprendizaje-profundo #deep-learning #funcion-activacion #inteligencia artificial #redes-neuronales #relu

The Rectified Linear Unit (ReLU): An Essential Tool for Deep Learning

Q: Why is it called Rectified Linear Unit, not Uniform Rectified Unit?

Because the function is not uniform: it is linear only from zero upward (f(x) = x for x > 0) and constant at zero everywhere else. "Rectified" comes from electronics, where a rectifier lets only the positive part of a signal through; ReLU does exactly that with a network's activations.

March 24, 2023 4 min 335

Gráfica de la función ReLU mostrando la salida cero para valores negativos y lineal para valores positivos

Table of contents

Key takeaways
How ReLU works
Why it surpassed sigmoid and tanh
Applications in deep learning
The dying ReLU problem
Frequently asked questions
Why is it called Rectified Linear Unit, not Uniform Rectified Unit?
When should Leaky ReLU be used instead of standard ReLU?
Is ReLU also used in the output layer?
Conclusion
Sources

Updated: 2026-07-17

ReLU (Rectified Linear Unit) is the activation function f(x) = max(0, x): it returns the input unchanged when positive and zero when negative. This rule, cheaper to compute than sigmoid or tanh and resistant to the vanishing gradient problem, made it the default activation function of modern deep learning after AlexNet popularised it in 2012.

This guide is also available in Spanish: La Función Unidad Rectificada Uniforme (RELU): Una Herramienta Esencial para el Aprendizaje Profundo.

Key takeaways

ReLU is defined as f(x) = max(0, x): the cheapest possible nonlinear activation function.
Popularised by AlexNet (2012), it marked the beginning of modern deep learning.
Avoids the vanishing gradient problem that affects sigmoid and tanh in deep networks.
Its main weakness is “dying ReLU”: permanently deactivated neurons, a problem that Leaky ReLU mitigates.
Remains the default function in most vision and NLP architectures.

How ReLU works

The ReLU function is mathematically defined as:

f(x) = max(0, x)

Its behaviour:

For x > 0: output is x (identity function).
For x ≤ 0: output is 0 (the neuron transmits no signal).

ReLU activation function plot showing the null region for negative x and the linear region for positive x

The interactive visualisation below plots ReLU alongside its derivative, which equals 1 for x > 0 and 0 for x < 0 (undefined at x = 0):

The elegance of ReLU is that there are no exponentiations, divisions, or saturations: on a GPU processing millions of activations per second, the cost difference versus sigmoid is enormous.

Why it surpassed sigmoid and tanh

Before ReLU, the sigmoid function and hyperbolic tangent dominated. Both have the same problem in deep networks: saturation and vanishing gradients.

When a sigmoid receives very large or very small inputs, its derivative approaches zero. Multiplying gradients layer by layer (chain rule), the gradient arrives practically extinguished at early layers. ReLU does not saturate in the positive region, so the gradient flows without attenuation.

Three concrete reasons for its massive adoption:

Training speed: 6× faster convergence than tanh in AlexNet experiments.
Sparse activations: at any given moment, many neurons return 0, generating more compact representations.
Ease of implementation: a simple max(0, x) is trivial in any framework.

Applications in deep learning

ReLU is the standard activation function in practice:

Image classification: AlexNet (2012), VGG, ResNet use ReLU in all convolutional layers.
Natural language processing: modern transformers use GELU and SiLU, smooth variants of the ReLU idea.
Speech recognition: deep speech architectures use ReLU in intermediate dense layers.
Generative networks (GAN): the generator typically uses ReLU or Leaky ReLU in hidden layers.

Comparison of several activation functions including ReLU and its variants

In the context of image analysis with computer vision, ReLU appears in practically every convolutional architecture. The pre-trained models used as a base for transfer learning carry ReLU or its variants in their feature-extraction layers.

For reinforcement learning, policy and value networks also default to ReLU in most reference implementations.

The dying ReLU problem

Its only structural weakness: if a neuron consistently receives very negative inputs, its output is always 0 and its gradient also. That neuron permanently stops learning.

Established solutions:

Leaky ReLU: replaces 0 with αx with small α (see full Leaky ReLU article).
ELU (Exponential Linear Unit): smooth curve for x < 0.
GELU: used in BERT and GPT, approximates ReLU with a Gaussian curve.
Careful weight initialisation: He initialisation reduces the probability of dying ReLU from the start.

Frequently asked questions

Why is it called Rectified Linear Unit, not Uniform Rectified Unit?

Because the function is not uniform: it is linear only from zero upward (f(x) = x for x > 0) and constant at zero everywhere else. “Rectified” comes from electronics, where a rectifier lets only the positive part of a signal through; ReLU does exactly that with a network’s activations.

When should Leaky ReLU be used instead of standard ReLU?

When the learning rate is high or the network has many layers, the risk of dying ReLU increases. Leaky ReLU replaces zero with a small slope (for example 0.01x) on the negative side, so no neuron ends up completely dead.

Is ReLU also used in the output layer?

Almost never. It is the default choice for hidden layers, but the output layer usually uses softmax for multiclass classification or a linear function for regression, since those tasks need a probability distribution or an unclipped value, not a sparse activation.

Conclusion

ReLU is the activation function that democratised deep learning. Its minimal computational cost, resistance to vanishing gradients, and compatibility with hundreds-of-layers architectures made it the industry standard. Understanding its strengths and the dying ReLU problem is essential for any professional designing or tuning deep neural networks.

Sources: Glorot, Bordes and Bengio, “Deep Sparse Rectifier Neural Networks” (AISTATS 2011)^[1], PyTorch documentation for torch.nn.ReLU^[2], TensorFlow documentation for tf.keras.activations.relu^[3], Stanford CS231n notes on activation functions^[4].

Sources

Route: The Neuron and Activation Functions

The Rectified Linear Unit (ReLU): An Essential Tool for Deep Learning

Key takeaways

How ReLU works

Why it surpassed sigmoid and tanh

Applications in deep learning

The dying ReLU problem

Frequently asked questions

Why is it called Rectified Linear Unit, not Uniform Rectified Unit?

When should Leaky ReLU be used instead of standard ReLU?

Is ReLU also used in the output layer?

Conclusion

Sources

AI explained without the hype

Share this article

Was this article helpful?

Related posts

OpenRouter: A Gateway for AI Models

browser-use: agents that browse the web

Firecrawl: Web Data for Agents

Composio: Tools and Integrations for Agents