Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial Tecnología

inferencia ml edge onnx onnx runtime pytorch tensorflow

ONNX Runtime at the Edge: Portable, Fast Inference

March 18, 2024 8 min read 71 reads

Table of contents

Key takeaways
What ONNX Actually Solves
Export: The Step That Looks Easy
Execution Providers: Where the Performance Lives
Quantization and Graph Optimisation
Browser, Mobile, and the Real Edge Case
Conclusion

Actualizado: 2026-05-03

Deploying a machine-learning model outside the notebook where it was trained is usually where the fantasy breaks: you train in PyTorch on a cloud GPU and suddenly need to serve inference on a Linux server, inside an iOS app, on an ARM industrial gateway, and in a customer’s browser tab. Each destination brings its own runtime. ONNX Runtime^[1] is what most teams arrive at when that pain becomes chronic.

Key takeaways

ONNX Runtime turns the ONNX format into a usable tool: export once from PyTorch or TensorFlow and run almost anywhere with the same artifact.
Execution Providers (EPs) separate the graph from hardware; the same code runs on a datacenter GPU in development and an edge CPU in production.
Dynamic quantization is the cheap entry: one line, no calibration, 4× size reduction for typical CNNs.
The real state of NPU EPs in Q1 2024 lags behind the commercial narrative.
If deployment is exclusively large NVIDIA GPU, TensorRT or vLLM will probably win.

What ONNX Actually Solves

The problem is not purely technical, it’s organisational. Without a bridge format, each new target adds weeks of re-engineering: convert the graph, validate outputs match within tolerance, rediscover which operators aren’t supported.

ONNX cuts that knot by proposing an open intermediate format — a computational graph with standard operators versioned by opset. ONNX Runtime is the reference implementation: a single .onnx artifact that works for server, mobile, browser and edge without duplicating tooling.

Export: The Step That Looks Easy

The critical element is declaring which axes are dynamic:

python

import torch

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
)

Without dynamic_axes, the model is frozen to batch size 1 and breaks in production with different batch sizes. The non-negotiable step after export is validation: feed the same tensor to both the PyTorch model and the ONNX Runtime session and compare at conservative tolerance.

Execution Providers: Where the Performance Lives

Configuration is a prioritised list: the runtime tries the first EP, falls to the next if hardware isn’t available, and lands on CPU as the safety net. The same Python code runs inference on the datacenter GPU in development and on the edge CPU in production without touching a line.

Quantization and Graph Optimisation

At load time, ONNX Runtime applies automatic passes — operator fusion, constant folding, dead-node elimination. The big jump comes from quantization:

Dynamic: one line, no calibration, acceptable for most CNNs. 4× size reduction.
Static: with a representative calibration dataset. Needed for the full 2-4× latency gain.

Browser, Mobile, and the Real Edge Case

onnxruntime-web runs ONNX models in the browser using WebGPU when available and WebAssembly as fallback. On embedded edge — Jetson, Raspberry Pi, ARM industrial gateways — the argument is portability: iterate on workstation with CUDA, validate on laptop CPU, deploy on Jetson Orin with the TensorRT EP without rewriting anything.

Conclusion

ONNX Runtime isn’t the fastest engine on any specific platform and almost no single benchmark will crown it champion. It doesn’t pretend to. Its proposition is to absorb the complexity of heterogeneity: one artifact, one API, many targets, and enough performance margin on each to make portability pay. The honest caveat in Q1 2024 is the state of NPUs: commercial narrative runs ahead of operational reality. For everything else — CPU, consumer GPU, browser, classic mobile — it’s already the sensible choice.

Was this useful?

[Total: 0 · Average: 0]

Post Views: 71

ONNX Runtime

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

ONNX Runtime at the Edge: Portable, Fast Inference

Key takeaways

What ONNX Actually Solves

Export: The Step That Looks Easy

Execution Providers: Where the Performance Lives

Quantization and Graph Optimisation

Browser, Mobile, and the Real Edge Case

Conclusion

Related posts

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step

Claude Code vs Cursor vs GitHub Copilot in 2026: a comparison with measured tasks

MCP (Model Context Protocol) in 2026: the complete guide for engineering teams