Dec 5 2024

(Stanford.ai)

Markov Decision Processes (part 1)
- Hardwick note: MDP is a Search Algorithm, but it’s taking into account different probabilities of different actions happening at each state, where as the previous search problems were deterministic.
- Example - walk the island and avoid the volcano?…what if the slip percentage increases?
- Policy Evaluation - based on a policy, apply MDP to see if you proceed
- Value Iteration - what is the best policy to take
- MDP - in States and Actions
  - States
  - Chance Nodes
  - MDP: (instead of minimizing cost, like in Search (A*, Uniform Cost Search, etc.), we want to maximize it because now it is a Reward, not a cost)
    - States Set = the set of all states
      - s_start - starting state
      - isEnd - at end of game?
    - Actions(a): possible actions from state S
  - Transition Probability(s, a, s’) - probability of ending up in state s’ if take action a in state s
    - Like Success(s, a) in Search problems - but things are not deterministic here
    - All probabilities of s’ then end up at an end state will add up to 1 (and they’ll be non-negative because we are talking about probabilities)
  - Reward(s, a, s’) - reward for the Transition (s, a, s’)
    - Like Cost(s, a) in Search problems - but things are not deterministic here
  - 0 <= gamma (discount) <= 1 (default: 1)
    - ~0 - the agent prioritizes immediate rewards
    - ~1 - the agent puts greater value on future rewards
    - Note: If graph is cyclic, then need gamma to be less than one (if acyclic, then can be 1)

Let’s repeat the Tram example, but with a probability of failure of tram failing of 0.5

See code example - ai-ml/tramMDP.py
What’s the solution? It’s not deterministic like Search problems…so we add a “policy”
POLICY - a policy (π) is a mapping from each state s (in the set of States) to an action a (in the set of Actions(from s)). But how good is the policy? We need to evaluate it.
- π(S) : Policy
- π: S → A(s)
Policy evaluation - I am not trying to figure out the policy, I’m just using the policy that you gave me and I’m evaluating how good it is.
UTILITY - following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is the random quantity).
- Utility = sum of the rewards ⇒ r1 + (gamma=>disount)r2 + gamma² + …
- Value is the expected Utility = I sum them up and average them, and that gives me value. You give me a bunch of random paths, I compute their utilities, and the value is their expected utility, the average.

Example

Path (random )	Utility
[in; stay, 4, end]	4
[in; stay, 4, in; stay, 4, in; stay, 4, end]	12
[in; stay, 4, in; stay, 4, end]	8
[in; stay, 4, in; stay, 4, in; stay, 4, end]	16

Policy VALUE and EXPECTED UTILITY - the value of a policy is the expected utility
DISCOUNTING - gamma of one = 1 (save for the future), but if you don’t care about the future as much, then you go for = 0 (live in the moment)
- In reality, we are somewhere in between…typically we use 1 or 0.9, depending on the real life situation of the problem domain
- Gamma is not a hyper parameter, it is a design choice.
POLICY EVALUATION - value of a policy.
- Let V_π(s) be the expected utility received by following policy π from state s.
- Given policy π, then I already know from state s_n that the action I’m going to take is π of s [ π(s) ]
  - (state) → π(s) → [chance node with (state s’, action = π(s)) ]
    - But from this new state at the new chance node, there are still different probabilities of what happens next Transition T(s, a, s’) to get to new state s’ which will have an expected utility of V_π(s’)
- V_π(s) is the expected utility from my actual states
- Q_π(s,a) - is the expected utility from the chance node

See [Markov Decision Processes 1 - Value Iteration

Stanford CS221: AI (Autumn 2019)(time = 47:11)](https://youtu.be/9g32v7bK3Co?si=KfnQRu5hP4Uz5-XD&t=2831) - basis for next few lectures

V_π(s) = { 0 when IsEnd(s) = True + Q_π(s,a) }

– but what is Q_π(s,a)?
IMPORTANT RECURRENCE
Q_π(s,a) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gamma_discount * V_π(s’) ], which means that
V_π(s) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gamma_discount * V_π(s) ] …since I’m not at an end state.
But this function directly above has V_π(s) on both sides of the equation, so I can compute V_π(s) either iteratively or using a closed form solution, but this is what we need to do so that we can EVALUATE the VALUE of policy V_π
In practice, the problems can get complex, so you would use iteration…and you iterate until the difference between the last two V_π(s) steps is below some small target you have set.

Summary of what we covered so far
- MDP - Graphs with states, chance nodes, transition probabilities, rewards
  - Policy - mapping from state to action (solution to MDP)
  - Value of Policy - expected utility over random paths
  - Policy Evaluation - iterative algorithm to compute value of policy
- Complexity - S = states, A = actions per state, S’ successors, t= iteration
  - Time: O( t_PE * S * S’ )
    - PE = policy evaluations
  - Doesn’t depend on actions because you’ve given me the policy
OPTIMAL VALUE AND POLICY
- Goal: try to get directly at maximum expected utility
- Definition: Optimal value V_opt(s) is the maximum value attained by any policy
  - Q_opt(s, a) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gamma_discount * V_opt(s’) ],
- Value iteration
  - π_opt(s) = arg-max Q_opt(s, a) where a is in all Actions(of s)
  - Similar to iterative algo for policy valuation, value iteration is taking the max between the last two runs of Q_opt(s, a) and the max for all prior runs
- Complexity - S = states, A = actions per state, S’ successors, t= iteration
  - Time: O( t_VI * S A S’ )
    - VI = Value iteration
  - Does depend on actions because you’re trying to find the best policy
CONVERGENCE for MDP
- An algorithm is said to converge when, as the number of iterations increases, its output (e.g., parameters, solutions, predictions) gets closer to a specific desired result or goal.
- Theorem: convergence
- Suppose either
  - Discount gamma < 1 or
  - MDP graph is acyclic (not circuitous) – we are doing dynamic programming over the whole thing, including each action probability at each state.
- NON-CONVERGENCE: If you have cycles, then you need your discount to be less than one or you will not converge. Gamma(discount) = 1, achieves zero rewards, you’ll get stuck in a state.
Summary of what we covered so far
- MDP - Markov Decision Processes (MDPs) cope with uncertainty
- Solutions are POLICIES rather than PATHS (like we had in search problems)
- POLICY EVALUATION computes policy value (expected utility) - which is just measuring the strength of that policy
- VALUE ITERATION computes optimal value (maximum expected utility) and, therefore, OPTIMAL POLICY
- Main techniques - write recurrences -> algorithm
Next time: reinforcement learning - when we don’t know rewards and transition probabilities

Markov Decision Processes (part 2)

Opens with a view of the triangle of
- Model
  - MDP
- Inference techniques
  - Value iteration
  - Policy evaluation
- Learning
  - RL (reinforcement learning)
  - which we are covering for the first time in this lecture)

Review from MDP Part 1 - what is a MDP?

A graph with states and action.
Actions take you to chance nodes with Transition probabilities and Rewards
And Gamma (discount) - which captures how you think about the reward gained in the future
Policy - for each state, dictates what action you take. The sequences of the path dictated by a policy is called an EPISODE
- V_π(s): Value of a policy - expected utility if you follow the policy (π) from state s.
- Q_π(s,a) - Policy Evaluation - expected utility if you first take action a from s and then follow policy (π)

REINFORCEMENT LEARNING - if we just get rid of the Transitions and Rewards, then we have to figure these all out a different way – through reinforcement learning!

(Notation - if they put a hat on the variable, that’s referring to the estimated not the actual.)
Model-based Monte Carlo approach - using samples to estimate a model
- Try to model the MDP -
  - Transitions (#tries over # total attempts ) and Rewards = whatever is observed and writing that down
- Note: Problems arise if you don’t explore all the states
Model-FREE Monte Carlo -
- Try to figure out Q_opt
On vs Off Policy * On-policy - estimate value of the policy that we are following (the data generated policy), and off-policy we are not following a policy * Both model-based and model-free are both on-policy
Variations on Model-Free Monte Carlo
- Averages
- Combinations (Convex combinations)
- Stochastic gradient (implied objective –> least squares )
  - Note the low scores for policy value for random approach
SARSA (“bootstrapping” - using r +Q-hat_π instead of data provided)
- Why SARSA? → name based on Q(s, a, r, s’, a’)
- SARSA uses estimated Q-hat_π instead of raw data (u). You don’t know Q_π but you are using Q-hat_π to estimate it, so you are pulling yourself up by your bootstraps instead of using the data provided.
Comparisons Table
Monte Carlo | SARSA | |———————————–|————————————| | u | r + Q̂π(s’, a’) | | Based on one path | Based on estimate | | Unbiased (+) | Biased (-) | | Large variance (-) | Small variance (+) | | Wait until end to update (–) | Can update immediately (++) |

Which Algos estimate Q_opt?

Just model-based Monte-Carlo, model-free and SARSA estimate Q-hat_π

Well, that’s Problem!: model-free Monte Carlo and SARSA only estimate Q_π , but want Q_opt to act optimally.

Output	MDP	Reinforcement Learning
Q_π	Policy evaluation	Model-free Monte Carlo, SARSA
Q_opt	Value iteration	new Q-learning

Concepts of Exploitation and exploration when chosing actions at a given state
- See video at this mark to see the difference between the two.
Key Idea - balance
- Need to balance exploration and exploitation
- Metaphor for Life? - In real life, you start out not knowing what’s going on, but you want to get rewards…at the same time you have to learn how the world works to improve your policy
- See here for that balance
  - Utility of 32!!!!!!!!!!!
- However, still a problem with large state spaces – hard to explore
Epsilon-greedy: balance the exploration with exploitation, but doesn’t work with large states.
Function approximation (basically, machine learning, RNN) - helps generalize to unseen states.

### Summary

Online Setting: learn and take actions in the real world Balance the exploration/exploitation tradeoff
Monte Carlo: estimate transitions, rewards, Q-values from data
Bootstrapping: update towards target that depends on the estimate-changes rather than just raw data.

#### Additional reading that I explored