Day 20 - Standford AI - Markov Decision Processes - Policy Evaluation, Optimal Value/Policy, Reinforcement Learning
Dec 5 2024
(Stanford.ai)
- Markov Decision Processes (part 1)
-
Hardwick note: MDP is a Search Algorithm, but it’s taking into account different probabilities of different actions happening at each state, where as the previous search problems were deterministic.
- Example - walk the island and avoid the volcano?…what if the slip percentage increases?
- Policy Evaluation - based on a policy, apply MDP to see if you proceed
-
Value Iteration - what is the best policy to take
- MDP - in States and Actions
- States
- Chance Nodes
-
MDP: (instead of minimizing cost, like in Search (A*, Uniform Cost Search, etc.), we want to maximize it because now it is a Reward, not a cost)
- States Set = the set of all states
- sstart - starting state
- isEnd - at end of game?
- Actions(a): possible actions from state S
- States Set = the set of all states
- Transition Probability(s, a, s’) - probability of ending up in state s’ if take action a in state s
- Like Success(s, a) in Search problems - but things are not deterministic here
- All probabilities of s’ then end up at an end state will add up to 1 (and they’ll be non-negative because we are talking about probabilities)
- Reward(s, a, s’) - reward for the Transition (s, a, s’)
- Like Cost(s, a) in Search problems - but things are not deterministic here
- 0 <= gamma (discount) <= 1 (default: 1)
- ~0 - the agent prioritizes immediate rewards
- ~1 - the agent puts greater value on future rewards
-
Note: If graph is cyclic, then need gamma to be less than one (if acyclic, then can be 1)
-
- Let’s repeat the Tram example, but with a probability of failure of tram failing of 0.5
-
See code example - ai-ml/tramMDP.py
- What’s the solution? It’s not deterministic like Search problems…so we add a “policy”
- POLICY - a policy (π) is a mapping from each state s (in the set of States) to an action a (in the set of Actions(from s)). But how good is the policy? We need to evaluate it.
- π(S) : Policy
- π: S → A(s)
- Policy evaluation - I am not trying to figure out the policy, I’m just using the policy that you gave me and I’m evaluating how good it is.
- UTILITY - following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is the random quantity).
- Utility = sum of the rewards ⇒ r1 + (gamma=>disount)r2 + gamma2 + …
- Value is the expected Utility = I sum them up and average them, and that gives me value. You give me a bunch of random paths, I compute their utilities, and the value is their expected utility, the average.
-
Example
Path (random ) Utility [in; stay, 4, end] 4 [in; stay, 4, in; stay, 4, in; stay, 4, end] 12 [in; stay, 4, in; stay, 4, end] 8 [in; stay, 4, in; stay, 4, in; stay, 4, end] 16 -
Policy VALUE and EXPECTED UTILITY - the value of a policy is the expected utility
- DISCOUNTING - gamma of one = 1 (save for the future), but if you don’t care about the future as much, then you go for = 0 (live in the moment)
- In reality, we are somewhere in between…typically we use 1 or 0.9, depending on the real life situation of the problem domain
-
Gamma is not a hyper parameter, it is a design choice.
- POLICY EVALUATION - value of a policy.
- Let Vπ(s) be the expected utility received by following policy π from state s.
- Given policy π, then I already know from state sn that the action I’m going to take is π of s [ π(s) ]
- (state) → π(s) → [chance node with (state s’, action = π(s)) ]
- But from this new state at the new chance node, there are still different probabilities of what happens next Transition T(s, a, s’) to get to new state s’ which will have an expected utility of Vπ(s’)
- (state) → π(s) → [chance node with (state s’, action = π(s)) ]
- Vπ(s) is the expected utility from my actual states
- Qπ(s,a) - is the expected utility from the chance node
-
See [Markov Decision Processes 1 - Value Iteration Stanford CS221: AI (Autumn 2019)(time = 47:11)](https://youtu.be/9g32v7bK3Co?si=KfnQRu5hP4Uz5-XD&t=2831) - basis for next few lectures -
Vπ(s) = { 0 when IsEnd(s) = True + Qπ(s,a) }
– but what is Qπ(s,a)?
-
IMPORTANT RECURRENCE
- Qπ(s,a) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gammadiscount * Vπ(s’) ], which means that
- Vπ(s) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gammadiscount * Vπ(s) ] …since I’m not at an end state.
- But this function directly above has Vπ(s) on both sides of the equation, so I can compute Vπ(s) either iteratively or using a closed form solution, but this is what we need to do so that we can EVALUATE the VALUE of policy Vπ
- In practice, the problems can get complex, so you would use iteration…and you iterate until the difference between the last two Vπ(s) steps is below some small target you have set.
-
- Summary of what we covered so far
- MDP - Graphs with states, chance nodes, transition probabilities, rewards
- Policy - mapping from state to action (solution to MDP)
- Value of Policy - expected utility over random paths
- Policy Evaluation - iterative algorithm to compute value of policy
- Complexity - S = states, A = actions per state, S’ successors, t= iteration
- Time: O( tPE * S * S’ )
- PE = policy evaluations
- Doesn’t depend on actions because you’ve given me the policy
- Time: O( tPE * S * S’ )
- MDP - Graphs with states, chance nodes, transition probabilities, rewards
- OPTIMAL VALUE AND POLICY
- Goal: try to get directly at maximum expected utility
- Definition: Optimal value Vopt(s) is the maximum value attained by any policy
- Qopt(s, a) = Sum over s-primes of T(s, a, s’) * [ immediate Reward R(s, a, s’) + gammadiscount * Vopt(s’) ],
- Value iteration
- πopt(s) = arg-max Qopt(s, a) where a is in all Actions(of s)
- Similar to iterative algo for policy valuation, value iteration is taking the max between the last two runs of Qopt(s, a) and the max for all prior runs
- Complexity - S = states, A = actions per state, S’ successors, t= iteration
- Time: O( tVI * S A S’ )
- VI = Value iteration
- Does depend on actions because you’re trying to find the best policy
- Time: O( tVI * S A S’ )
- CONVERGENCE for MDP
-
An algorithm is said to converge when, as the number of iterations increases, its output (e.g., parameters, solutions, predictions) gets closer to a specific desired result or goal.
- Theorem: convergence
- Suppose either
- Discount gamma < 1 or
- MDP graph is acyclic (not circuitous) – we are doing dynamic programming over the whole thing, including each action probability at each state.
- NON-CONVERGENCE: If you have cycles, then you need your discount to be less than one or you will not converge. Gamma(discount) = 1, achieves zero rewards, you’ll get stuck in a state.
-
- Summary of what we covered so far
- MDP - Markov Decision Processes (MDPs) cope with uncertainty
- Solutions are POLICIES rather than PATHS (like we had in search problems)
- POLICY EVALUATION computes policy value (expected utility) - which is just measuring the strength of that policy
- VALUE ITERATION computes optimal value (maximum expected utility) and, therefore, OPTIMAL POLICY
- Main techniques - write recurrences -> algorithm
- Next time: reinforcement learning - when we don’t know rewards and transition probabilities
-
- Markov Decision Processes (part 2)
- Opens with a view of the triangle of
- Model
- MDP
- Inference techniques
- Value iteration
- Policy evaluation
- Learning
- RL (reinforcement learning)
-
which we are covering for the first time in this lecture)
- Model
- Review from MDP Part 1 - what is a MDP?
- A graph with states and action.
- Actions take you to chance nodes with Transition probabilities and Rewards
- And Gamma (discount) - which captures how you think about the reward gained in the future
- Policy - for each state, dictates what action you take. The sequences of the path dictated by a policy is called an EPISODE
- Vπ(s): Value of a policy - expected utility if you follow the policy (π) from state s.
- Qπ(s,a) - Policy Evaluation - expected utility if you first take action a from s and then follow policy (π)
- REINFORCEMENT LEARNING - if we just get rid of the Transitions and Rewards, then we have to figure these all out a different way – through reinforcement learning!
- (Notation - if they put a hat on the variable, that’s referring to the estimated not the actual.)
- Model-based Monte Carlo approach - using samples to estimate a model
- Try to model the MDP -
- Transitions (#tries over # total attempts ) and Rewards = whatever is observed and writing that down
-
Note: Problems arise if you don’t explore all the states
- Try to model the MDP -
- Model-FREE Monte Carlo -
- Try to figure out Qopt
-
On vs Off Policy * On-policy - estimate value of the policy that we are following (the data generated policy), and off-policy we are not following a policy * Both model-based and model-free are both on-policy
- Variations on Model-Free Monte Carlo
- Averages
- Combinations (Convex combinations)
- Stochastic gradient (implied objective –> least squares )
- Note the low scores for policy value for random approach
- SARSA (“bootstrapping” - using r +Q-hatπ instead of data provided)
- Why SARSA? → name based on Q(s, a, r, s’, a’)
- SARSA uses estimated Q-hatπ instead of raw data (u). You don’t know Qπ but you are using Q-hatπ to estimate it, so you are pulling yourself up by your bootstraps instead of using the data provided.
-
Comparisons Table
Monte Carlo | SARSA | |———————————–|————————————| | u | r + Q̂π(s’, a’) | | Based on one path | Based on estimate | | Unbiased (+) | Biased (-) | | Large variance (-) | Small variance (+) | | Wait until end to update (–) | Can update immediately (++) | - Which Algos estimate Qopt?
-
Just model-based Monte-Carlo, model-free and SARSA estimate Q-hatπ
-
Well, that’s Problem!: model-free Monte Carlo and SARSA only estimate Qπ , but want Qopt to act optimally.
Output MDP Reinforcement Learning Qπ Policy evaluation Model-free Monte Carlo, SARSA Qopt Value iteration ** new ** Q-learning
-
- Concepts of Exploitation and exploration when chosing actions at a given state
- See video at this mark to see the difference between the two.
- Key Idea - balance
- Need to balance exploration and exploitation
-
Metaphor for Life? - In real life, you start out not knowing what’s going on, but you want to get rewards…at the same time you have to learn how the world works to improve your policy
- See here for that balance
-
Utility of 32!!!!!!!!!!!
-
- However, still a problem with large state spaces – hard to explore
- Epsilon-greedy: balance the exploration with exploitation, but doesn’t work with large states.
- Function approximation (basically, machine learning, RNN) - helps generalize to unseen states.
### Summary
- Online Setting: learn and take actions in the real world Balance the exploration/exploitation tradeoff
- Monte Carlo: estimate transitions, rewards, Q-values from data
- Bootstrapping: update towards target that depends on the estimate-changes rather than just raw data.
#### Additional reading that I explored
- Opens with a view of the triangle of