Just two years ago the idea of running a useful language model on an industrial device was more aspiration than reality. Open large models demanded servers with tens of gigabytes of GPU memory, and small ones were curious but too limited for practical tasks. In 2025 this equation has changed. Phi-3.5, Gemma 2, Llama 3.2, and the first small Qwen 2.5 have shown that a 2 to 8 billion parameter model, well trained, can solve concrete tasks with production-quality output. That opens new space at the industrial edge where latency, data sovereignty, and connectivity cost make local pay off. A review of where these models fit on the floor, where they don’t, and how to integrate them without adding unnecessary complexity.
What has changed for small models
Until 2023 the term SLM (Small Language Model) was almost derogatory. Models of 1 to 3 billion parameters were toys compared to GPT-3.5 or Llama 1. Two things have changed perception. The first is training corpus quality: small models of 2024 and 2025 are trained with carefully filtered data, synthetic reasoning, and high-quality post-training refinement. A 3.8 billion parameter model like Phi-3.5-mini performs on reasoning tasks comparable to GPT-3.5 from two years ago.
The second is the maturity of the execution environment. Tools like llama.cpp, Ollama, and vLLM have polished quantization, efficient weight loading, and batching. A model that previously required a 24 GB GPU now runs on a decent CPU with 16 GB RAM with acceptable latencies. For non-interactive tasks, this is sufficient. For human-pace interactive tasks, a modest GPU or an integrated NPU is enough.
The fundamental point is that a small model well-used on a bounded task outperforms a large model poorly-directed on the same task. If what you want is classifying text, extracting fields, generating short summaries, or answering questions about specific documents, an adjusted small model with good prompting performs at the level the use case demands. Bringing a large model to the edge to do the same is, in most cases, waste.
Why the industrial edge suits them
Plants, warehouses, and points of sale are environments with three restrictions that the edge serves better than the cloud. The first is connectivity: there isn’t always a stable internet link, and when there is it’s expensive or slow. A local model removes connectivity dependency for tasks it can solve, leaving only rare cases for the cloud.
The second is latency. For an industrial process depending on a reading to act, a two-second response because the packet had to cross the ocean is a problem. A local model responds in milliseconds. This matters especially for tasks embedded in control loops where inference is part of the production flow.
The third is data sovereignty. Many plants have sensitive information: diagrams, recipes, customer data, production figures. Sending this information to an external provider even for a trivial task has compliance implications. A local model keeps data within the perimeter and simplifies NIS2, GDPR, or sector-specific compliance.
These three restrictions combine so that the industrial edge is probably the best natural fit for SLMs. In a typical office environment, where connectivity is good and sovereignty doesn’t press as hard, the balance tilts more easily toward the cloud.
Tasks that already work well
There’s a set of tasks where SLMs at the edge already work with production quality. The first is structured extraction from free text (invoices, delivery notes, incident reports, OCR from labels): a small model with a well-designed prompt extracts fields with accuracy rates above 95% on most industrial documents. The second is text classification and routing: an operator writes a report, the model decides whether it’s a critical incident, which team it belongs to, or whether it’s a duplicate. Here edge latency and sovereignty play in favor.
The third is generation of short summaries: a supervisor receives dozens of shift readings and a small model synthesizes them into a daily summary with important findings. The fourth is bounded conversational assistance like an internal chatbot about procedures or manuals, with RAG over a limited document corpus. Quality isn’t GPT-4 level, but for standard operational queries it’s more than enough.
Where SLMs still fall short
SLMs fall short on multi-step reasoning over complex problems: computing long sequences, debugging complicated code, analyzing legal text with nuance. The difference between a 4 billion and a 70 billion parameter model is huge and qualitative. They also fail in long conversation with lots of context because windows are smaller and coherence degrades when history lengthens.
They also fail on any task requiring broad and updated knowledge (answering current events, translating cultural slang, interpreting recent references) and on creative free generation: for marketing material, literary text, or original code from open specifications, large models still produce clearly better results.
How they’re typically deployed
The edge deployment pattern seen working in 2025 has three layers. In hardware, a machine with a powerful CPU or modest GPU (RTX 4060 or equivalent) with 32 to 64 GB RAM: between 1,500 and 3,000 euros per edge node. For low-latency or high-concurrency tasks, a recent integrated GPU from AMD or Intel is enough.
In the execution layer, Ollama or a server based on vLLM. Ollama is convenient for starting because it handles downloads, quantization, and serves an OpenAI-compatible API. vLLM scales better for high concurrency: under 10 requests per second Ollama is enough, above that something with aggressive batching suits better. At application level, the model is consumed as a local API with REST calls just like against cloud; the difference is only the URL, which simplifies migration between both sides.
In the Ollama configuration file it’s common to see ollama pull phi3.5:3.8b-mini-instruct-q4_K_M to download the quantized version, and then curl http://localhost:11434/api/generate -d '{"model": "phi3.5:3.8b-mini-instruct-q4_K_M", "prompt": "Extract batch number"}' to invoke it.
Model selection in 2025
Phi-3.5-mini performs well on reasoning and structured tasks, and its reduced size makes it ideal for CPUs without GPU. Gemma 2 (in its 2 and 9 billion variants) has good general quality with stable instruction format. Llama 3.2 excels at multilingual tasks and has very good support in Ollama. Qwen 2.5 shines in translation nuance and broad knowledge.
For native Spanish, Llama 3.2 and Gemma 2 usually beat Phi-3.5 (optimized for English). The difference between models in the same size band is small; investment in prompt engineering and evaluation is what truly moves quality. It’s worth having an in-house evaluation set with fifty or a hundred real examples and running it against each candidate: that set is worth more than any public benchmark because it captures case peculiarities.
When it pays off
My practical criterion in 2025 is clear: if the task is bounded, high-volume, and latency matters, the edge with SLM is the right option. If the task requires complex reasoning, broad world knowledge, or the volume is low, cloud with a large model is still better. The key is not to think large versus small as a dichotomy, but as different tools with different cases.
The architecture I see working most is hybrid: an SLM at the edge for 90% of routine requests, and a fallback to a large cloud model for the hard cases the local model flags as uncertain. This pattern combines the best of both worlds: low cost, low latency, and sovereignty for most, with an ace up the sleeve for rare cases.
The cultural change that’s missing is to stop seeing the large model as the default answer. Many teams start with GPT-4 or Claude in the cloud for anything, simply because it’s what they know. For a large fraction of industrial cases, a small local model is better: faster, cheaper, more private, and with fewer dependencies. Learning to distinguish which cases fall on each side is a competence worth developing, and one that will weigh more in coming years as SLMs keep improving.