Reinforcement Learning Introduction

4 min readAug 19, 2023

Reinforcement learning is learning what to do, and how to map the situations to actions, so as to maximize a numerical reward signal

The problem of reinforcement learning is formalised using ideas from dynamical systems theory, specifically, as the optimal control of incomplete-known Markov decission processes.

Reinforcement learning may not be classified as supervised or unsupervised learning, because it does not have labelled data and doesn’t try to discover hidden patterns. It’s classified as a third paradigm of the model itself.

One challenge that arises in reinforcement learning, and not other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of rewards, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing rewards. But to discover such actions, it has to try actions it has not selected. The agent has to exploit what it has already experienced to obtain a reward, but it also has to explore to make better action selections in the future. This is called the exploration-exploitation dilemma.

The four elements of reinforcement learning

Beyond the agent and the environment we can identify four main subelements:

Policy

A policy is a critical concept that dictates how an agent should select actions in a given environment to achieve its goals. The policy defines the strategy or rules that the agent uses to decide which actions to take based on its observations of the environment.

Mathematically, a policy is denoted as π(a|s), where:

π represents the policy.
“a” is the action that the agent selects.
“s” is the state or observation of the environment in which the agent finds itself.

In simpler terms, the policy π(a|s) gives the probability that the agent will take action “a” when it’s in state “s”. The policy can be deterministic, where it directly maps each state to a specific action, or stochastic, where it provides a probability distribution over possible actions for each state.

Reward signal

The reward signal is a fundamental concept in reinforcement learning (RL) that provides feedback to the learning agent about the quality of its actions in an environment. It is a scalar value that indicates the immediate benefit or desirability of the agent’s action in a given state. The reward signal serves as a way to guide the agent’s learning process towards making better decisions that lead to higher cumulative rewards over time.

In mathematical terms, the reward signal is denoted as “r” and is a function of the current state “s” and the action “a” taken by the agent:

r = R(s, a)

value function

The value function is a concept used to assess the expected cumulative reward an agent can obtain from being in a specific state and following a particular policy. The value function helps the agent make decisions by quantifying how good or valuable a given state is, considering the potential future rewards it can lead to.

Mathematically, the value function for a state “s” under a policy π is denoted as “V(s)” and represents the expected cumulative reward the agent can obtain from state “s” onwards while following policy π. It’s defined as the sum of the expected immediate reward and the expected future rewards discounted by a discount factor “γ”:

V(s) = E[Σt=0 to ∞ γ^t * r_t+1 | s, π]

Optionally a model

The “model” of the environment refers to an agent’s learned representation of how the environment behaves. This representation allows the agent to simulate and predict the outcomes of its actions without having to interact with the real environment. Models are used in model-based reinforcement learning approaches to simulate possible future states, transitions, and rewards, which in turn helps the agent make informed decisions to maximize its cumulative rewards.

Summary

The introduction provides an overview of reinforcement learning (RL) as a learning approach that maximizes rewards by mapping situations to actions. It is distinct from supervised and unsupervised learning and involves the optimal control of Markov decision processes. One of the challenges unique to RL is the exploration-exploitation trade-off. The post introduces four essential elements in RL: policy, reward signal, value function, and optionally, a model of the environment.

Conclusion

Reinforcement learning stands as a distinctive paradigm in machine learning, focused on learning optimal actions to maximize rewards within dynamic environments. Its uniqueness lies in its emphasis on decision-making rather than pattern discovery. The exploration-exploitation challenge encapsulates the delicate balance agents must strike between venturing into new actions and exploiting known effective strategies. The four core elements — policy, reward signal, value function, and the optional environment model — constitute the framework through which RL agents navigate uncertainty and learn to make informed choices, advancing the field’s capacity to tackle complex real-world problems.

If you liked this post, I usually post about maths, machine learning, and starting to publish about data engineering and programming. Do not hesitate to follow my profile to get notified about new posts

https://medium.com/@crunchyml