Day 36 - Stanford AI/ML - Game Play (2 of 2)
December 29 - Stanford AI/ML - Game Play (2 of 2)
(Stanford AI/ML)
Game Playing 2 - TD Learning, Game Theory | Stanford CS221: Artificial Intelligence (Autumn 2019)
- Learning the evaluation instead of “hand crafting”
- Eval(State) = V(s;w)
- Neural Network!
- V(s; w, v1:k) = Sum from j=1 to k ( wj * sigmoid ( vj * phi(s))
- The “Vs” are the number of layers in the neural network?
- Backgammon game example, then simplified. Feature template for state would be…
- Number of (O or X) in column [1…4]
- Number of (O or X) on bar
- Fraction of (O or X) removed
- It is (O or X’s) turn
- Temporal Difference (TD) Learning Algo - Using Neural Network to find the utilities.
- Exploration vs Getting Stuck in a state? Well, this game has dice so we don’t need to introduce epsilon greedy here, because the dice adds stochasticism to the mix already
- Episodes → all the paths through the game. State-zero = action 1, reward 1; state 1 = action 2 reward 2, etc., etc. (note that reward throughout in this came is zero until the very end)
- On each (s, a, r, s’)
- Objective function - ½ (prediction(w) - target)2
- Gradient (prediction(w) - target)(deltaw prediction(w))
- Update: w ← w - n[(prediction(w)-target)]deltaw* prediction(w)]
- Note: overlap with “gradient”
- Update: w ← w - n[V(s;w) -r + gammaV(s’;w)]deltaw V(s;w)
- If linear algo, then V(s;s) = w * phi(s), therefore
- w ← w - n[ wphi(s) - (r+gamma* w * phi(s’))*phi(s)
- Interesting question about features…and if playing blackjack, then what’s the difference between 10 and J…”picking features is an art”
- Comparison to Q-learning
- Q function operations over states and Actions, TD Learning operates on Value function of just state
- Q-learning is off-policy: value is based on estimate of optimal policy, TD learning’s value is based on exploration policy
- Q-Learning doesn’t require you to know the MDP transitions (T(s,a, s’), TD learning requires you know the rules of the game Successor function is a function of state and action => Succ(s,a)
- Summary so far…
- Turn-based games (you go, I go, etc.) and are zero-sum (only one winner)
- Parameterized eval functions using features
- TD Learning: learn an evaluation function
- (prediction(w) - target)^2
- Up-next
- Simultaneous games
- Non-zero-sum
- Simultaneous Games, Zero-Sum
- If Pure strategies, then going 2nd is better
- If PlayerA plays a mix strategy, PlayerB should play a pure strategy
- (von Neumann’s Theorem) - if applying mixed strategy, order of play doesn’t matter. (you can reveal your optimal strategy and outcome will be the same!)
- Showed the linear equations of playerA maximizing their win, and playerB trying to minimizeA, and vice versa. See where lines cross for max/min, or min/max
- Real life vs Games - Prisoner’s dilemma (both testify= 5 years in jail each; both refuse = 0 years each, one does and other doesn’t then 10 years for 1 and 0 for the other ) …can’t apply von Neuman’s here, isn’t zero-sum game.
- Nash (beautiful mind!) Nash Equilibrium is ( pi*A, pi*piB) - such that no player has incentive to change their strategy
- VA (pi*A, pi*piB) >= VA (piA, pi*piB) for all piA
- VB (pi*A, pi*piB) >= V*A (piA, piB) for all piB
- Nash (beautiful mind!) Nash Equilibrium is ( pi*A, pi*piB) - such that no player has incentive to change their strategy