Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

bert deep-learning fine-tuning machine learning modelos-pre-entrenados transfer-learning

Pre-trained Models and Transfer Learning

March 18, 2023 11 min read 178 reads

Table of contents

Key takeaways
Why transfer learning changes the rules
How it works: the three main approaches
Reference pre-trained models
When transfer works and when it doesn’t
Risks and best practices
Conclusion

Actualizado: 2026-05-03

Training a deep learning model from scratch for a new task is expensive in data, time, and money. Transfer learning solves this: it takes a model that has already learned useful representations in a large domain and adapts it to a new task with much less effort. This is why today a team with a modest budget can build an accurate text classifier or a functional object detector without its own GPU farm.

Key takeaways

A pre-trained model is a neural network trained on a large dataset (ImageNet, massive text corpus) that has already learned general domain representations.
Transfer learning takes those representations and adapts them to a new task through fine-tuning, feature extraction, or prompt engineering.
The most relevant pre-trained models in vision are ResNet, EfficientNet, and the Vision Transformer (ViT) family; in language, BERT, GPT, T5, and their derivatives.
Transfer works best when the source and target domains have similarities; the more distant they are, the more proprietary data is needed.
Fine-tuning has risks: excessive adjustment to the small dataset can degrade performance on out-of-distribution data.

Why transfer learning changes the rules

Training a model like GPT-3 or BERT from scratch required thousands of GPU hours and datasets of hundreds of gigabytes. Those resources are out of reach for most teams. Transfer learning changes the equation:

A resource-rich organisation trains the base model on massive data.
The base model learns general representations: edges and textures in images, semantic relationships between words, grammatical structures.
A resource-limited team takes that model and adapts it to their specific task with a few thousand examples and a few hours of compute.

The result typically surpasses a model trained from scratch on proprietary data, because the base model already knows the world — the general domain distribution — and only needs to learn the particularities of the new problem.

How it works: the three main approaches

Feature extraction The internal layers of the pre-trained model are frozen — their weights are not modified — and used as feature extractors. Only the final layers (the classifier “head”) are trained with the new data. This is the fastest approach, requires the least proprietary data, but is the least flexible.

Fine-tuning Some or all of the pre-trained model’s layers are unfrozen and re-trained with a very low learning rate on the new task’s data. Existing weights are slightly adjusted, preserving general knowledge while adapting to domain particularities. This is the most common approach in production for vision and NLP.

Prompt engineering and in-context learning With large language models (LLMs) like GPT-4 or LLaMA, no additional training is sometimes needed. The model is conditioned through natural language instructions (prompts) and examples in the query context. This approach underlies tools like Microsoft 365 Copilot.

Multi-layer neural network diagram showing input, hidden, and output layers, the base of deep learning models

Reference pre-trained models

In computer vision:

ResNet (He et al., 2015): the residual architecture that made training very deep networks possible. Available in variants from 18 to 152 layers. Standard starting point for image classification and object detection.
EfficientNet: optimises the balance between network width, depth, and resolution. Very efficient in the parameters-to-accuracy ratio.
CLIP (OpenAI): trained with image-text pairs, enables zero-shot classification and multimodal search without fine-tuning.

In natural language processing:

BERT (Google, 2018): bidirectional model pre-trained with masked language modelling. Standard for text classification, named entity recognition, and question answering.
GPT and variants (OpenAI): autoregressive models optimised for text generation. GPT-4 is the most capable in the family.
T5 (Google): encoder-decoder that converts any NLP task into a text-to-text task. Flexible and powerful for translation, summarisation, and QA.

When transfer works and when it doesn’t

Transfer learning is not always the best option. Conditions that favour its use:

Limited proprietary data: if you have fewer than 10,000 labelled examples, starting from a pre-trained model is almost always better than training from scratch.
Domain similar to pre-training: a medical X-ray classifier benefits from ResNet pre-trained on ImageNet because the image structure is similar, even if the semantic domain differs.
Time and compute constraints: fine-tuning a pre-trained model can be achieved in hours rather than weeks.

Conditions that reduce effectiveness:

Very distant domain: if the data type is radically different from pre-training data (for example, radar signals vs. natural images), transfer may contribute little or nothing.
Base model has incompatible biases: a language model trained on English text may transfer poorly to morphologically complex languages if no base model exists in that language.
Abundant, specific proprietary data: if you have millions of examples from the target domain, training from scratch may outperform fine-tuning.

Risks and best practices

Catastrophic forgetting When a pre-trained model is fine-tuned too aggressively on small data, it can “forget” general knowledge and overfit to the proprietary dataset. Mitigations: low learning rates, L2 regularisation, and staged fine-tuning (unfreezing layers from the most shallow to the deepest).

Data leakage in evaluation If evaluation data has any similarity to pre-training data (for example, the test set was part of the base model’s training data), evaluation metrics will be optimistic. Truly independent evaluation sets are essential.

Interpretability Pre-trained models add a layer of complexity to interpretability. XAI techniques like those described in explaining AI through XAI are especially relevant when deploying large-scale models in high-impact decisions.

Conclusion

Transfer learning democratises access to deep learning: teams without access to massive data or compute infrastructure can build high-quality models from pre-trained bases. BERT for NLP, ResNet for vision, CLIP for multimodal tasks — these are solid starting points for most projects. Well-executed fine-tuning — with conservative learning rates, independent evaluation sets, and degradation monitoring — is today the standard model development strategy in data-limited environments.

Was this useful?

[Total: 10 · Average: 4.1]

Post Views: 178

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Pre-trained Models and Transfer Learning

Key takeaways

Why transfer learning changes the rules

How it works: the three main approaches

Reference pre-trained models

When transfer works and when it doesn’t

Risks and best practices

Conclusion

Related posts

“EU AI Act 2026: a technical checklist for Spanish CTOs”

Agent observability with OpenTelemetry GenAI semconv in 2026

How to install and tune oMLX on M5 Max 128 GB

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026