ONNX Runtime is the Microsoft-driven multiplatform inference runtime that converts ONNX (Open Neural Network Exchange) from spec to practical tool. Your PyTorch or TensorFlow model, exported to ONNX, runs nearly identically on Linux server, Android mobile, iPhone, browser, and edge device. This article covers using it well, when it’s the right choice, and where it falls short.
Why ONNX Matters
The problem it solves: runtime fragmentation. Train in PyTorch, serve in TensorFlow Serving. Mobile: Core ML on iOS, TensorFlow Lite on Android. Browser: TensorFlow.js. Each step is a different conversion with its bug set.
ONNX offers an open intermediate format. ONNX Runtime executes it. One model → many platforms.
Export from PyTorch
import torch
import torch.onnx
model = MyModel()
model.inference_mode()
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"}
},
opset_version=17
)
Key points:
opset_version: ONNX operator version. Newer = more ops supported, but runtime must match.dynamic_axes: variable axes (batch size typically). Without this, export is static-shape.- Verify with
onnx.checker.check_model(model).
Basic Inference
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx")
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"input": input_data})
Simple, no heavy dependencies. ONNX Runtime memory is ~15MB base + model weights.
Execution Providers: The Performance Secret
ONNX Runtime supports Execution Providers (EPs) accelerating per hardware:
- CPU: default, optimised.
- CUDA: NVIDIA GPUs.
- TensorRT: even better on NVIDIA, with additional conversion.
- OpenVINO: Intel CPUs, integrated GPUs, VPUs.
- CoreML: Apple Silicon.
- DirectML: Windows GPUs (including AMD, Intel).
- ROCm: AMD Linux GPUs.
- WebGPU / WebNN: browser.
- NNAPI: Android.
- QNN: Qualcomm (Snapdragon).
- MIGraphX: AMD data center.
Example choosing EP:
session = ort.InferenceSession(
"model.onnx",
providers=[
"CUDAExecutionProvider",
"CPUExecutionProvider"
]
)
Runtime picks first available. Safe fallback if hardware absent.
Cases Where ONNX Shines
- Mobile apps with ML models: same export runs on iOS and Android without extra work.
- Heterogeneous edge devices: Jetson, Raspberry Pi, industrial gateways.
- Browser inference via onnxruntime-web: models run in tab without server.
- Framework transition: export from PyTorch, serve from any stack.
- Compliance / control: enterprises wanting to avoid ML-cloud lock-in.
Cases Where ONNX Falls Short
- Large LLM models (>7B params): ONNX Runtime supports, but vLLM or TensorRT-LLM more efficient.
- Non-standard custom ops: if your model uses very custom PyTorch kernels, may not export.
- Training: ONNX Runtime Training exists but niche; PyTorch dominates.
- Very new models: cutting-edge operators may not be in ONNX opset.
Optimisations
ONNX Runtime includes automatic optimisations at load:
- Graph optimisation: op fusion, constant folding.
- Kernel fusion per-EP specific.
- Optional quantization — INT8, INT4 — with minimal quality degradation.
Quantization in one line:
from onnxruntime.quantization import quantize_dynamic
quantize_dynamic("model.onnx", "model_int8.onnx")
Typical reduction: 4x size, 2-4x faster, <1% quality loss.
ONNX Runtime Web
An underrated feature: onnxruntime-web runs ONNX models directly in browser via WebGPU or WebAssembly.
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create(
'./model.onnx',
{ executionProviders: ['webgpu', 'wasm'] }
);
const feeds = { input: new ort.Tensor('float32', data, [1, 3, 224, 224]) };
const results = await session.run(feeds);
Uses: image classification, object detection, Whisper for transcription — all client-side, no server.
Mobile: ONNX Runtime Mobile
Mobile-optimised version with small binaries:
- Android:
.aarintegrable in Gradle projects. - iOS: Swift/Objective-C framework.
- React Native: existing bindings.
- Flutter: community plugins.
For 20-100MB models in mobile apps, it’s the simplest option.
Alternatives to Consider
- PyTorch JIT / LibTorch: for stay-in-PyTorch deployment.
- TensorFlow Lite: for TF ecosystem, good on mobile.
- TensorRT (NVIDIA): performance ceiling on NVIDIA GPUs, but lock-in.
- CoreML (Apple): optimal on Apple Silicon only.
- OpenVINO (Intel): excellent on Intel hardware.
ONNX Runtime is the “universal” trade-off: lower top-end than specialised, but portable.
Typical Development Workflow
Pattern that works:
- Train in PyTorch/TF with GPU.
- Export to ONNX with
torch.onnx.export. - Validate output matches original model.
- Optimise:
onnxoptimizer+ quantization if applicable. - Benchmark on each target (server, mobile, browser).
- Deploy with ONNX Runtime on each platform.
For each target, ~1 day adaptation vs weeks of rework with other stacks.
Post-Export Validation
Critical: verify exported model produces same outputs as original:
import onnxruntime as ort
import numpy as np
torch_out = model(test_input).detach().numpy()
session = ort.InferenceSession("model.onnx")
onnx_out = session.run(None, {"input": test_input.numpy()})[0]
assert np.allclose(torch_out, onnx_out, rtol=1e-3, atol=1e-5)
Common divergences:
- Custom ops not fully supported.
- Precision differences (float32 vs float16).
- Batch normalisation with training vs inference-mode subtleties.
Preconverted Models: ONNX Zoo
The ONNX Model Zoo has dozens of already-converted verified models: YOLOv8, BERT, ResNet, SSD, MobileNet, and more. If your case fits one, saves the export.
Performance Limitations
In benchmarks:
- CPU: ONNX Runtime ≥ TensorFlow, < PyTorch JIT in some cases.
- NVIDIA GPU: ONNX Runtime CUDA ~95% of TensorRT, without complexity.
- Apple GPU: CoreML > ONNX Runtime for Apple-optimised models.
- Mobile: competitive with TFLite in most.
If you need absolute top performance on a specific platform, that platform’s native runtime wins. ONNX Runtime wins in portability and simplicity.
Production Operation
Checklist:
- Model versioning: hash + metadata in your registry.
- Monitoring: latency, throughput per session.
- Memory management: sessions may accumulate if not released.
- Warmup: first inference slower from kernel compilation.
- Fallback EP: always
CPUExecutionProviderat the end.
Conclusion
ONNX Runtime is a powerful tool for teams serving ML on multiple platforms or wanting to avoid runtime lock-in. Portability is its big advantage; its limit is that it’s not absolute top on any single platform. For mobile apps, heterogeneous edge, browser inference, and framework transition, it’s almost always the sensible choice. For NVIDIA-large-server-only workloads, slightly more optimal alternatives may exist. Its maturity and Microsoft backing ensure continuity.
Follow us on jacar.es for more on production ML, edge computing, and inference architectures.