ONNX Runtime at the Edge: Portable, Fast Inference

Placa electrónica con chip central iluminado representando inferencia ML en hardware

ONNX Runtime is the Microsoft-driven multiplatform inference runtime that converts ONNX (Open Neural Network Exchange) from spec to practical tool. Your PyTorch or TensorFlow model, exported to ONNX, runs nearly identically on Linux server, Android mobile, iPhone, browser, and edge device. This article covers using it well, when it’s the right choice, and where it falls short.

Why ONNX Matters

The problem it solves: runtime fragmentation. Train in PyTorch, serve in TensorFlow Serving. Mobile: Core ML on iOS, TensorFlow Lite on Android. Browser: TensorFlow.js. Each step is a different conversion with its bug set.

ONNX offers an open intermediate format. ONNX Runtime executes it. One model → many platforms.

Export from PyTorch

import torch
import torch.onnx

model = MyModel()
model.inference_mode()

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    opset_version=17
)

Key points:

  • opset_version: ONNX operator version. Newer = more ops supported, but runtime must match.
  • dynamic_axes: variable axes (batch size typically). Without this, export is static-shape.
  • Verify with onnx.checker.check_model(model).

Basic Inference

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"input": input_data})

Simple, no heavy dependencies. ONNX Runtime memory is ~15MB base + model weights.

Execution Providers: The Performance Secret

ONNX Runtime supports Execution Providers (EPs) accelerating per hardware:

  • CPU: default, optimised.
  • CUDA: NVIDIA GPUs.
  • TensorRT: even better on NVIDIA, with additional conversion.
  • OpenVINO: Intel CPUs, integrated GPUs, VPUs.
  • CoreML: Apple Silicon.
  • DirectML: Windows GPUs (including AMD, Intel).
  • ROCm: AMD Linux GPUs.
  • WebGPU / WebNN: browser.
  • NNAPI: Android.
  • QNN: Qualcomm (Snapdragon).
  • MIGraphX: AMD data center.

Example choosing EP:

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        "CUDAExecutionProvider",
        "CPUExecutionProvider"
    ]
)

Runtime picks first available. Safe fallback if hardware absent.

Cases Where ONNX Shines

  • Mobile apps with ML models: same export runs on iOS and Android without extra work.
  • Heterogeneous edge devices: Jetson, Raspberry Pi, industrial gateways.
  • Browser inference via onnxruntime-web: models run in tab without server.
  • Framework transition: export from PyTorch, serve from any stack.
  • Compliance / control: enterprises wanting to avoid ML-cloud lock-in.

Cases Where ONNX Falls Short

  • Large LLM models (>7B params): ONNX Runtime supports, but vLLM or TensorRT-LLM more efficient.
  • Non-standard custom ops: if your model uses very custom PyTorch kernels, may not export.
  • Training: ONNX Runtime Training exists but niche; PyTorch dominates.
  • Very new models: cutting-edge operators may not be in ONNX opset.

Optimisations

ONNX Runtime includes automatic optimisations at load:

  • Graph optimisation: op fusion, constant folding.
  • Kernel fusion per-EP specific.
  • Optional quantization — INT8, INT4 — with minimal quality degradation.

Quantization in one line:

from onnxruntime.quantization import quantize_dynamic

quantize_dynamic("model.onnx", "model_int8.onnx")

Typical reduction: 4x size, 2-4x faster, <1% quality loss.

ONNX Runtime Web

An underrated feature: onnxruntime-web runs ONNX models directly in browser via WebGPU or WebAssembly.

import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create(
  './model.onnx',
  { executionProviders: ['webgpu', 'wasm'] }
);

const feeds = { input: new ort.Tensor('float32', data, [1, 3, 224, 224]) };
const results = await session.run(feeds);

Uses: image classification, object detection, Whisper for transcription — all client-side, no server.

Mobile: ONNX Runtime Mobile

Mobile-optimised version with small binaries:

  • Android: .aar integrable in Gradle projects.
  • iOS: Swift/Objective-C framework.
  • React Native: existing bindings.
  • Flutter: community plugins.

For 20-100MB models in mobile apps, it’s the simplest option.

Alternatives to Consider

  • PyTorch JIT / LibTorch: for stay-in-PyTorch deployment.
  • TensorFlow Lite: for TF ecosystem, good on mobile.
  • TensorRT (NVIDIA): performance ceiling on NVIDIA GPUs, but lock-in.
  • CoreML (Apple): optimal on Apple Silicon only.
  • OpenVINO (Intel): excellent on Intel hardware.

ONNX Runtime is the “universal” trade-off: lower top-end than specialised, but portable.

Typical Development Workflow

Pattern that works:

  1. Train in PyTorch/TF with GPU.
  2. Export to ONNX with torch.onnx.export.
  3. Validate output matches original model.
  4. Optimise: onnxoptimizer + quantization if applicable.
  5. Benchmark on each target (server, mobile, browser).
  6. Deploy with ONNX Runtime on each platform.

For each target, ~1 day adaptation vs weeks of rework with other stacks.

Post-Export Validation

Critical: verify exported model produces same outputs as original:

import onnxruntime as ort
import numpy as np

torch_out = model(test_input).detach().numpy()

session = ort.InferenceSession("model.onnx")
onnx_out = session.run(None, {"input": test_input.numpy()})[0]

assert np.allclose(torch_out, onnx_out, rtol=1e-3, atol=1e-5)

Common divergences:

  • Custom ops not fully supported.
  • Precision differences (float32 vs float16).
  • Batch normalisation with training vs inference-mode subtleties.

Preconverted Models: ONNX Zoo

The ONNX Model Zoo has dozens of already-converted verified models: YOLOv8, BERT, ResNet, SSD, MobileNet, and more. If your case fits one, saves the export.

Performance Limitations

In benchmarks:

  • CPU: ONNX Runtime ≥ TensorFlow, < PyTorch JIT in some cases.
  • NVIDIA GPU: ONNX Runtime CUDA ~95% of TensorRT, without complexity.
  • Apple GPU: CoreML > ONNX Runtime for Apple-optimised models.
  • Mobile: competitive with TFLite in most.

If you need absolute top performance on a specific platform, that platform’s native runtime wins. ONNX Runtime wins in portability and simplicity.

Production Operation

Checklist:

  • Model versioning: hash + metadata in your registry.
  • Monitoring: latency, throughput per session.
  • Memory management: sessions may accumulate if not released.
  • Warmup: first inference slower from kernel compilation.
  • Fallback EP: always CPUExecutionProvider at the end.

Conclusion

ONNX Runtime is a powerful tool for teams serving ML on multiple platforms or wanting to avoid runtime lock-in. Portability is its big advantage; its limit is that it’s not absolute top on any single platform. For mobile apps, heterogeneous edge, browser inference, and framework transition, it’s almost always the sensible choice. For NVIDIA-large-server-only workloads, slightly more optimal alternatives may exist. Its maturity and Microsoft backing ensure continuity.

Follow us on jacar.es for more on production ML, edge computing, and inference architectures.

Entradas relacionadas