Pre-trained Models and Transfer Learning
Actualizado: 2026-05-03
Training a deep learning model from scratch for a new task is expensive in data, time, and money. Transfer learning solves this: it takes a model that has already learned useful representations in a large domain and adapts it to a new task with much less effort. This is why today a team with a modest budget can build an accurate text classifier or a functional object detector without its own GPU farm.
Key takeaways
- A pre-trained model is a neural network trained on a large dataset (ImageNet, massive text corpus) that has already learned general domain representations.
- Transfer learning takes those representations and adapts them to a new task through fine-tuning, feature extraction, or prompt engineering.
- The most relevant pre-trained models in vision are ResNet, EfficientNet, and the Vision Transformer (ViT) family; in language, BERT, GPT, T5, and their derivatives.
- Transfer works best when the source and target domains have similarities; the more distant they are, the more proprietary data is needed.
- Fine-tuning has risks: excessive adjustment to the small dataset can degrade performance on out-of-distribution data.
Why transfer learning changes the rules
Training a model like GPT-3 or BERT from scratch required thousands of GPU hours and datasets of hundreds of gigabytes. Those resources are out of reach for most teams. Transfer learning changes the equation:
- A resource-rich organisation trains the base model on massive data.
- The base model learns general representations: edges and textures in images, semantic relationships between words, grammatical structures.
- A resource-limited team takes that model and adapts it to their specific task with a few thousand examples and a few hours of compute.
The result typically surpasses a model trained from scratch on proprietary data, because the base model already knows the world — the general domain distribution — and only needs to learn the particularities of the new problem.
How it works: the three main approaches
Feature extraction The internal layers of the pre-trained model are frozen — their weights are not modified — and used as feature extractors. Only the final layers (the classifier “head”) are trained with the new data. This is the fastest approach, requires the least proprietary data, but is the least flexible.
Fine-tuning Some or all of the pre-trained model’s layers are unfrozen and re-trained with a very low learning rate on the new task’s data. Existing weights are slightly adjusted, preserving general knowledge while adapting to domain particularities. This is the most common approach in production for vision and NLP.
Prompt engineering and in-context learning With large language models (LLMs) like GPT-4 or LLaMA, no additional training is sometimes needed. The model is conditioned through natural language instructions (prompts) and examples in the query context. This approach underlies tools like Microsoft 365 Copilot.
Reference pre-trained models
In computer vision:
- ResNet (He et al., 2015): the residual architecture that made training very deep networks possible. Available in variants from 18 to 152 layers. Standard starting point for image classification and object detection.
- EfficientNet: optimises the balance between network width, depth, and resolution. Very efficient in the parameters-to-accuracy ratio.
- CLIP (OpenAI): trained with image-text pairs, enables zero-shot classification and multimodal search without fine-tuning.
In natural language processing:
- BERT (Google, 2018): bidirectional model pre-trained with masked language modelling. Standard for text classification, named entity recognition, and question answering.
- GPT and variants (OpenAI): autoregressive models optimised for text generation. GPT-4 is the most capable in the family.
- T5 (Google): encoder-decoder that converts any NLP task into a text-to-text task. Flexible and powerful for translation, summarisation, and QA.
When transfer works and when it doesn’t
Transfer learning is not always the best option. Conditions that favour its use:
- Limited proprietary data: if you have fewer than 10,000 labelled examples, starting from a pre-trained model is almost always better than training from scratch.
- Domain similar to pre-training: a medical X-ray classifier benefits from ResNet pre-trained on ImageNet because the image structure is similar, even if the semantic domain differs.
- Time and compute constraints: fine-tuning a pre-trained model can be achieved in hours rather than weeks.
Conditions that reduce effectiveness:
- Very distant domain: if the data type is radically different from pre-training data (for example, radar signals vs. natural images), transfer may contribute little or nothing.
- Base model has incompatible biases: a language model trained on English text may transfer poorly to morphologically complex languages if no base model exists in that language.
- Abundant, specific proprietary data: if you have millions of examples from the target domain, training from scratch may outperform fine-tuning.
Risks and best practices
Catastrophic forgetting When a pre-trained model is fine-tuned too aggressively on small data, it can “forget” general knowledge and overfit to the proprietary dataset. Mitigations: low learning rates, L2 regularisation, and staged fine-tuning (unfreezing layers from the most shallow to the deepest).
Data leakage in evaluation If evaluation data has any similarity to pre-training data (for example, the test set was part of the base model’s training data), evaluation metrics will be optimistic. Truly independent evaluation sets are essential.
Interpretability Pre-trained models add a layer of complexity to interpretability. XAI techniques like those described in explaining AI through XAI are especially relevant when deploying large-scale models in high-impact decisions.
Conclusion
Transfer learning democratises access to deep learning: teams without access to massive data or compute infrastructure can build high-quality models from pre-trained bases. BERT for NLP, ResNet for vision, CLIP for multimodal tasks — these are solid starting points for most projects. Well-executed fine-tuning — with conservative learning rates, independent evaluation sets, and degradation monitoring — is today the standard model development strategy in data-limited environments.