Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Tecnología

ONNX Runtime at the Edge: Portable, Fast Inference

ONNX Runtime at the Edge: Portable, Fast Inference

Actualizado: 2026-05-03

Deploying a machine-learning model outside the notebook where it was trained is usually where the fantasy breaks: you train in PyTorch on a cloud GPU and suddenly need to serve inference on a Linux server, inside an iOS app, on an ARM industrial gateway, and in a customer’s browser tab. Each destination brings its own runtime. ONNX Runtime[1] is what most teams arrive at when that pain becomes chronic.

Key takeaways

  • ONNX Runtime turns the ONNX format into a usable tool: export once from PyTorch or TensorFlow and run almost anywhere with the same artifact.
  • Execution Providers (EPs) separate the graph from hardware; the same code runs on a datacenter GPU in development and an edge CPU in production.
  • Dynamic quantization is the cheap entry: one line, no calibration, 4× size reduction for typical CNNs.
  • The real state of NPU EPs in Q1 2024 lags behind the commercial narrative.
  • If deployment is exclusively large NVIDIA GPU, TensorRT or vLLM will probably win.

What ONNX Actually Solves

The problem is not purely technical, it’s organisational. Without a bridge format, each new target adds weeks of re-engineering: convert the graph, validate outputs match within tolerance, rediscover which operators aren’t supported.

ONNX cuts that knot by proposing an open intermediate format — a computational graph with standard operators versioned by opset. ONNX Runtime is the reference implementation: a single .onnx artifact that works for server, mobile, browser and edge without duplicating tooling.

Export: The Step That Looks Easy

The critical element is declaring which axes are dynamic:

python
import torch

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
)

Without dynamic_axes, the model is frozen to batch size 1 and breaks in production with different batch sizes. The non-negotiable step after export is validation: feed the same tensor to both the PyTorch model and the ONNX Runtime session and compare at conservative tolerance.

Execution Providers: Where the Performance Lives

Configuration is a prioritised list: the runtime tries the first EP, falls to the next if hardware isn’t available, and lands on CPU as the safety net. The same Python code runs inference on the datacenter GPU in development and on the edge CPU in production without touching a line.

Quantization and Graph Optimisation

At load time, ONNX Runtime applies automatic passes — operator fusion, constant folding, dead-node elimination. The big jump comes from quantization:

  • Dynamic: one line, no calibration, acceptable for most CNNs. 4× size reduction.
  • Static: with a representative calibration dataset. Needed for the full 2-4× latency gain.

Browser, Mobile, and the Real Edge Case

onnxruntime-web runs ONNX models in the browser using WebGPU when available and WebAssembly as fallback. On embedded edge — Jetson, Raspberry Pi, ARM industrial gateways — the argument is portability: iterate on workstation with CUDA, validate on laptop CPU, deploy on Jetson Orin with the TensorRT EP without rewriting anything.

Conclusion

ONNX Runtime isn’t the fastest engine on any specific platform and almost no single benchmark will crown it champion. It doesn’t pretend to. Its proposition is to absorb the complexity of heterogeneity: one artifact, one API, many targets, and enough performance margin on each to make portability pay. The honest caveat in Q1 2024 is the state of NPUs: commercial narrative runs ahead of operational reality. For everything else — CPU, consumer GPU, browser, classic mobile — it’s already the sensible choice.

Was this useful?
[Total: 0 · Average: 0]
  1. ONNX Runtime

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.