Artificial Intelligence

#aprendizaje-por-refuerzo #deep-learning #inteligencia artificial #machine learning #rlhf #robotica

Reinforcement Learning: An Autonomous Learning Technique

March 18, 2023 5 min 197 4.3

Diagram of the reinforcement learning cycle: agent, environment, action, reward, and state

Table of contents

Key takeaways
Components of reinforcement learning
The learning process: four stages
Applications of reinforcement learning
Advantages and limitations
Conclusion
Sources

Updated: 2026-07-12

Reinforcement learning (RL) is the artificial intelligence technique that teaches a system to make optimal decisions through accumulated experience, guided by reward and penalty signals. Unlike supervised learning, which requires human-labelled examples, RL learns by interacting directly with an environment: it tries actions, observes results, and adjusts its strategy to maximise long-term reward.

Key takeaways

RL is based on three components: policy (how the agent acts), value function (how good a state is), and reward function (the objective).
The learning process follows four stages: observation, action selection, feedback, and policy update.
Applications range from robotics and video games to industrial system optimisation and language model fine-tuning.
RL does not require labelled data, but does need a well-designed reward function, which is hard to define for real-world problems.
The main limitations are high training sample demand and the risk of unexpected solutions when the reward is poorly specified.

Components of reinforcement learning

RL describes the interaction between an agent and an environment. The three fundamental components governing this interaction are:

Policy (π)

The policy is the strategy the agent follows to select an action given an observed state of the environment. It can be deterministic (always chooses the same action for the same state) or stochastic (chooses with some probability).

Value function (V or Q)

Measures how good it is to be in a given state (V) or to execute an action in a state (Q) in terms of expected future reward. It is the guide that allows the agent to reason about long-term consequences, not just the immediate step.

Reward function (R)

Defines the agent’s objective: assigns a numerical value to each state transition. The agent does not control it (the system designer provides it), and its correct specification is one of the biggest practical challenges in RL, as Sutton and Barto^[1] lay out in the field’s reference text.

Diagram of the reinforcement learning cycle showing the interaction between agent, environment, action, state, and reward

The learning process: four stages

The reinforcement learning cycle repeats iteratively:

Observation: the agent perceives the current state of the environment through sensors or data inputs.
Action selection: based on its current policy, the agent chooses an action. Initially this choice is almost random (exploration); as it learns, it becomes more deliberate (exploitation). Balancing exploration and exploitation is one of RL’s fundamental problems.
Feedback: the environment returns a reward (positive, negative, or zero) and a new state. This signal is the only supervision the agent receives.
Policy update: the agent adjusts its policy and value function using the received feedback. Algorithms like Q-Learning, SARSA, or policy gradient methods (PPO, A3C) implement this step in different ways.

This process connects directly to the principles of development and advances in artificial intelligence: RL is one of the three major families of machine learning, alongside supervised and unsupervised learning.

Applications of reinforcement learning

RL applications cover a broad spectrum:

Robotics:

Autonomous navigation in unstructured environments.
Object manipulation: learning to grasp, sort, or assemble parts without programming each movement.
Locomotion control of bipedal and quadruped robots.

Video games and simulation:

AlphaGo^[2] (DeepMind) beat Lee Sedol, widely considered the world’s strongest Go player, 4-1 in Seoul in March 2016, using RL combined with deep neural networks.
OpenAI Five^[3] learned to play Dota 2 as a team until it beat OG, the reigning International champions, in 2019, then went on to win 99.4% of 42,729 subsequent public games.
Video game environments are popular test benches because the reward function (score) is predefined and the system can train thousands of hours in simulated time.

Industrial optimisation:

Google used RL^[4] to optimise cooling in its data centres, cutting cooling energy use by 40% (a 15% drop in overall PUE overhead).
In telecoms, RL adjusts network parameters in real time to maximise bandwidth.

Language models (RLHF):

Fine-tuning with human feedback (Reinforcement Learning from Human Feedback, described in OpenAI’s InstructGPT paper^[5]) is the technique that allows ChatGPT and similar models to align their responses with human preferences. See ChatGPT 4 for more context.

RL also feeds real-time Big Data analytics tools: when data volumes are so large that manual analysis is impossible, RL agents can learn action policies directly from data streams.

Advantages and limitations

Advantages:

Learns in environments where labelled data does not exist.
Adapts to changing environments: if the environment changes, the agent can relearn.
Can discover non-intuitive strategies that humans would not have considered.

Real limitations:

High sample demand: in complex problems, the agent needs millions of interactions before converging to a reasonable policy. In the physical world, this can be costly or dangerous.
Reward specification: poorly defining the reward function leads to unexpected or harmful behaviours (reward hacking). A robot instructed to maximise points may find exploits the designers did not anticipate.
Local minima: the agent may get trapped in suboptimal solutions if initial exploration is insufficient.
Poor transferability: a policy learned in one environment rarely transfers well to a different one without retraining.

Conclusion

Reinforcement learning is the AI technique closest to how living beings learn: through interaction and consequences, not direct instruction. Its achievements in strategy games, robotics, and system optimisation are remarkable, and with RLHF it has become the key piece in fine-tuning modern language models. The main challenge is not technical but one of design: correctly specifying what the system wants the agent to maximise, because the agent will take it literally.

This article is also available in Spanish: El aprendizaje por refuerzo: una técnica de aprendizaje autónomo.

Reinforcement Learning: An Autonomous Learning Technique

Key takeaways

Components of reinforcement learning

The learning process: four stages

Applications of reinforcement learning

Advantages and limitations

Conclusion

Sources

AI explained without the hype

Share this article

Was this article helpful?

Related posts

OpenRouter: A Gateway for AI Models

browser-use: agents that browse the web

Firecrawl: Web Data for Agents

Composio: Tools and Integrations for Agents