#16: Reinforcement Learning

Created

Dec 9, 2021 07:19 PM

Topics

Characteristics Composition Reward Reward Hypothesis of RL History State Environment State Agent State Information State (Markov State)Environment Fully Observable Environment Partial Observable Environment Agent Policy Value Function Model Algorithm Reinforcement Learning Exploitation & Exploration Current Progress Question Resources

Characteristics

No supervisor, only reward signal

Feedback is delayed, not instantaneous

Time matters (sequential), non IID data

Actions affect the subsequent data

Composition

Reward

Scalar feedback signal

Received after each step

Reward Hypothesis of RL

Maximization of expected cumulative reward = goals of the task

History

Sequence of all observations, actions, rewards

Except the current action

State

A function of history

Compresses the huge history sequence to a single vector

Environment State

Env's private state (not visible to agent)

From which generates next observation & reward

Agent State

Agent's internal representation

Any function of history

Inputs to the RL algorithm

Information State (Markov State)

Contains all useful info about the history

💡

Markov Property

The future is independent of the past given the present

Question: 2 stage modeling?

relates to

But comes from

does not relate to

Environment

Fully Observable Environment

Agent directly observes env state

⇒ Markov Decision Process (MDP)

Partial Observable Environment

Agent indirectly observes env state

e.g. CV, trading bot, poker bot

⇒ Partially Observable Markov Decision Process (POMDP)

Agent must construct own state representation

E.g.

Agent

Policy

Defines the agent's behavior: maps state to action

Deterministic policy:

Stochastic policy:

Value Function

Predicts future reward by state

Evaluates goodness of states to choose an action

Model

Predicts immediate future state & reward by action

Next State:

Next Action:

Algorithm

Reinforcement Learning

Rules are unknown

Learn directly from interactive game play

Perform actions, see scores, make plans

Exploitation & Exploration

Exploitation: perform the best know action

Exploration: do something random

Current Progress

Value function is hard to learn

use DL to model the value function

Question

Is RL a MPS or an alternative method?

Can combine supervised learning & RL?

Resources

Interested in learning more about reinforcement learning? Follow along in this video series as DeepMind Principal Scientist, creator of AlphaZero and 2019 ACM Computing Prize Winner David Silver, gives a comprehensive explanation of everything RL.

This classic 10 part course, taught by Reinforcement Learning (RL) pioneer David Silver, was recorded in 2015 and remains a popular resource for anyone wanting to understand the fundamentals of RL. Reinforcement Learning has emerged as a powerful technique in modern machine learning, allowing a system to learn through a process of trial and error.

https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver

Jan 2, 2022