Page cover image

#16: Reinforcement Learning

    Dec 9, 2021 07:19 PM


    • No supervisor, only reward signal
    • Feedback is delayed, not instantaneous
    • Time matters (sequential), non IID data
    • Actions affect the subsequent data



    • Scalar feedback signal
    • Received after each step

    Reward Hypothesis of RL

    Maximization of expected cumulative reward = goals of the task


    • Sequence of all observations, actions, rewards
    • Except the current action


    • A function of history
    • Compresses the huge history sequence to a single vector

    Environment State

    Env's private state (not visible to agent)
    From which generates next observation & reward

    Agent State

    Agent's internal representation
    Any function of history
    Inputs to the RL algorithm

    Information State (Markov State)

    Contains all useful info about the history
    Markov Property
    The future is independent of the past given the present
    Question: 2 stage modeling?
    • relates to
    • But comes from
    • does not relate to


    Fully Observable Environment

    • Agent directly observes env state
      • Markov Decision Process (MDP)

    Partial Observable Environment

    • Agent indirectly observes env state
      • e.g. CV, trading bot, poker bot
      • Partially Observable Markov Decision Process (POMDP)
    • Agent must construct own state representation
      • E.g.


    notion image


    Defines the agent's behavior: maps state to action
    • Deterministic policy:
    • Stochastic policy:

    Value Function

    Predicts future reward by state
    Evaluates goodness of states to choose an action


    Predicts immediate future state & reward by action
    • Next State:
    • Next Action:


    Reinforcement Learning

    Rules are unknown
    Learn directly from interactive game play
    Perform actions, see scores, make plans

    Exploitation & Exploration

    Exploitation: perform the best know action
    Exploration: do something random

    Current Progress

    Value function is hard to learn
    use DL to model the value function


    1. Is RL a MPS or an alternative method?
    1. Can combine supervised learning & RL?


    Jan 2, 2022