December 29 - Stanford AI/ML - Game Play (2 of 2)

(Stanford AI/ML)

Game Playing 2 - TD Learning, Game Theory | Stanford CS221: Artificial Intelligence (Autumn 2019)

Learning the evaluation instead of “hand crafting”
- Eval(State) = V(s;w)
- Neural Network!
  - V(s; w, v1:k) = Sum from j=1 to k ( w_j * sigmoid ( v_j * phi(s))
  - The “Vs” are the number of layers in the neural network?
Backgammon game example, then simplified. Feature template for state would be…
- Number of (O or X) in column [1…4]
- Number of (O or X) on bar
- Fraction of (O or X) removed
- It is (O or X’s) turn
Temporal Difference (TD) Learning Algo - Using Neural Network to find the utilities.
- Exploration vs Getting Stuck in a state? Well, this game has dice so we don’t need to introduce epsilon greedy here, because the dice adds stochasticism to the mix already
- Episodes → all the paths through the game. State-zero = action 1, reward 1; state 1 = action 2 reward 2, etc., etc. (note that reward throughout in this came is zero until the very end)
- On each (s, a, r, s’)
  - Objective function - ½ (prediction(w) - target)²
  - Gradient (prediction(w) - target)(delta_w prediction(w))
  - Update: w ← w - n[(prediction(w)-target)]delta_w* prediction(w)]
  - Note: overlap with “gradient”
  - Update: w ← w - n[V(s;w) -r + gammaV(s’;w)]delta_w V(s;w)
  - If linear algo, then V(s;s) = w * phi(s), therefore
  - w ← w - n[ wphi(s) - (r+gamma* w * phi(s’))*phi(s)
  - Interesting question about features…and if playing blackjack, then what’s the difference between 10 and J…”picking features is an art”
- Comparison to Q-learning
  - Q function operations over states and Actions, TD Learning operates on Value function of just state
  - Q-learning is off-policy: value is based on estimate of optimal policy, TD learning’s value is based on exploration policy
  - Q-Learning doesn’t require you to know the MDP transitions (T(s,a, s’), TD learning requires you know the rules of the game Successor function is a function of state and action => Succ(s,a)
- Summary so far…
  - Turn-based games (you go, I go, etc.) and are zero-sum (only one winner)
  - Parameterized eval functions using features
  - TD Learning: learn an evaluation function
    - (prediction(w) - target)^2
  - Up-next
    - Simultaneous games
    - Non-zero-sum
- Simultaneous Games, Zero-Sum
  - If Pure strategies, then going 2nd is better
  - If PlayerA plays a mix strategy, PlayerB should play a pure strategy
  - (von Neumann’s Theorem) - if applying mixed strategy, order of play doesn’t matter. (you can reveal your optimal strategy and outcome will be the same!)
    - Showed the linear equations of playerA maximizing their win, and playerB trying to minimizeA, and vice versa. See where lines cross for max/min, or min/max
- Real life vs Games - Prisoner’s dilemma (both testify= 5 years in jail each; both refuse = 0 years each, one does and other doesn’t then 10 years for 1 and 0 for the other ) …can’t apply von Neuman’s here, isn’t zero-sum game.
  - Nash (beautiful mind!) Nash Equilibrium is ( pi^*_A, pi^*pi_B) - such that no player has incentive to change their strategy
    - V_A (pi^*_A, pi^*pi_B) >= V_A (pi_A, pi^*pi_B) for all pi_A
    - V_B (pi^*_A, pi^*pi_B) >= V^*_A (pi_A, pi_B) for all pi_B