Artificial Intelligence

Mathematical Formulation of Artificial Neural Network Input

Mathematical Formulation of Artificial Neural Network Input

Actualizado: 2026-05-03

Beneath the surface of any neural network lies linear algebra: vectors, matrices, and functions. Understanding the mathematical formulation of the input is not an academic exercise — it is the foundation for debugging models, diagnosing gradient problems, and choosing activation functions with sound reasoning.

Key takeaways

  • Each input to a neural network is represented as a column vector x of n dimensions.
  • The hidden layer applies a linear transformation via a weight matrix W and bias vector b, followed by a non-linear activation function.
  • The activation function introduces non-linearity without which the network would be equivalent to a simple linear regression.
  • Training adjusts weights W and b by minimising a loss function via gradient descent with backpropagation.
  • Vanishing and exploding gradients are the main mathematical problem in deep networks.

The mathematical representation of the input

A data input sample is represented as a column vector of dimension n:

$$mathbf{x} = begin{pmatrix} x_1 \ x_2 \ vdots \ x_n end{pmatrix} in mathbb{R}^n$$

Where each xᵢ is a feature of the data point: a pixel in an image, an encoded word in text, a numerical value in a table. For a batch of B samples, the input is organised as a matrix X of dimension B × n, enabling parallel processing of multiple samples via efficient matrix operations on GPU.

The hidden layer: linear transformation followed by non-linearity

For a hidden layer with M neurons, the operation is:

z = Wx + b

h = f(z)

Where:

  • W is the weight matrix of dimension M × n. Row j of W contains the weights for neuron j.
  • b is the bias vector of dimension M, which allows shifting the activation independently of the input.
  • f is the activation function, applied element-wise.

The bias b is critical: without it, if all inputs are zero, the output would also be zero regardless of the weights, limiting the network’s expressive capacity.

Activation functions: why they matter

Without a non-linear activation function, the composition of linear layers remains linear. A 100-layer network without activations is mathematically equivalent to a single linear layer. The most commonly used activation functions are:

  • Sigmoid: f(z) = 1 / (1 + e⁻ᶻ) — output in (0,1), useful in the output layer for binary classification, but prone to vanishing gradients in deep networks.
  • ReLU (Rectified Linear Unit): f(z) = max(0, z) — computationally efficient and mitigates the vanishing gradient problem; the default function in hidden layers of deep networks.
  • Tanh: f(z) = (eᶻ – e⁻ᶻ) / (eᶻ + e⁻ᶻ) — zero-centred output, better behaviour than sigmoid in hidden layers.
  • Softmax: used in the output layer for multi-class classification — converts a vector of arbitrary values into a probability distribution. See Softmax function.
  • Linear function: f(z) = z — no transformation, used in the output layer for regression. See linear activation function.
Logistic curve (sigmoid function) showing the transformation of real values to probabilities between 0 and 1

The backpropagation algorithm

Training consists of adjusting W and b to minimise a loss function L (e.g., cross-entropy for classification, MSE for regression). The backpropagation algorithm computes the gradient of L with respect to each parameter using the chain rule:

$$frac{partial L}{partial mathbf{W}^{(l)}} = frac{partial L}{partial mathbf{h}^{(l)}} cdot frac{partial mathbf{h}^{(l)}}{partial mathbf{z}^{(l)}} cdot frac{partial mathbf{z}^{(l)}}{partial mathbf{W}^{(l)}}$$

The process, step by step:

  1. Forward pass: compute the network’s prediction.
  2. Loss calculation: compare prediction to real label.
  3. Backward pass: compute gradients layer by layer from output to input.
  4. Parameter update: with gradient descent, W ← W – η · ∂L/∂W, where η is the learning rate.

The vanishing gradient problem occurs when gradients become exponentially small as they propagate backwards through many layers — mainly with sigmoid and tanh. ReLU and its variants (Leaky ReLU, ELU) mitigate this problem. The opposite problem — exploding gradients — is addressed with gradient clipping.

For more architectural context, see neural networks and deep learning. For multi-class classification from the output layer perspective, see Softmax function. Practical use of these models in rapid benchmarks is covered in LazyPredict in Python.

Conclusion

The mathematical formulation of a neural network is elegant in its structure: linear algebra in each layer, non-linearity in each activation, and iterative optimisation in training. Understanding these fundamentals is not optional for anyone who wants to go beyond using libraries as black boxes. Diagnosing a model that does not converge, choosing the correct activation function, or designing the right architecture all depend directly on understanding the mathematics underneath.

Was this useful?
[Total: 11 · Average: 4.3]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.