December 29 - Stanford AI/ML - Game Play (2 of 2)

(Stanford AI/ML)

Game Playing 2 - TD Learning, Game Theory | Stanford CS221: Artificial Intelligence (Autumn 2019)

  • Learning the evaluation instead of “hand crafting”
    • Eval(State) = V(s;w)
    • Neural Network!
      • V(s; w, v1:k) = Sum from j=1 to k ( wj * sigmoid ( vj * phi(s))
      • The “Vs” are the number of layers in the neural network?
  • Backgammon game example, then simplified. Feature template for state would be…
    • Number of (O or X) in column [1…4]
    • Number of (O or X) on bar
    • Fraction of (O or X) removed
    • It is (O or X’s) turn
  • Temporal Difference (TD) Learning Algo - Using Neural Network to find the utilities.
    • Exploration vs Getting Stuck in a state? Well, this game has dice so we don’t need to introduce epsilon greedy here, because the dice adds stochasticism to the mix already
    • Episodes → all the paths through the game. State-zero = action 1, reward 1; state 1 = action 2 reward 2, etc., etc. (note that reward throughout in this came is zero until the very end)
    • On each (s, a, r, s’)
      • Objective function - ½ (prediction(w) - target)2
      • Gradient (prediction(w) - target)(deltaw prediction(w))
      • Update: w ← w - n[(prediction(w)-target)]deltaw* prediction(w)]
      • Note: overlap with “gradient”
      • Update: w ← w - n[V(s;w) -r + gammaV(s’;w)]deltaw V(s;w)
      • If linear algo, then V(s;s) = w * phi(s), therefore
      • w ← w - n[ wphi(s) - (r+gamma* w * phi(s’))*phi(s)
      • Interesting question about features…and if playing blackjack, then what’s the difference between 10 and J…”picking features is an art”
    • Comparison to Q-learning
      • Q function operations over states and Actions, TD Learning operates on Value function of just state
      • Q-learning is off-policy: value is based on estimate of optimal policy, TD learning’s value is based on exploration policy
      • Q-Learning doesn’t require you to know the MDP transitions (T(s,a, s’), TD learning requires you know the rules of the game Successor function is a function of state and action => Succ(s,a)
    • Summary so far…
      • Turn-based games (you go, I go, etc.) and are zero-sum (only one winner)
      • Parameterized eval functions using features
      • TD Learning: learn an evaluation function
        • (prediction(w) - target)^2
      • Up-next
        • Simultaneous games
        • Non-zero-sum
    • Simultaneous Games, Zero-Sum
      • If Pure strategies, then going 2nd is better
      • If PlayerA plays a mix strategy, PlayerB should play a pure strategy
      • (von Neumann’s Theorem) - if applying mixed strategy, order of play doesn’t matter. (you can reveal your optimal strategy and outcome will be the same!)
        • Showed the linear equations of playerA maximizing their win, and playerB trying to minimizeA, and vice versa. See where lines cross for max/min, or min/max
    • Real life vs Games - Prisoner’s dilemma (both testify= 5 years in jail each; both refuse = 0 years each, one does and other doesn’t then 10 years for 1 and 0 for the other ) …can’t apply von Neuman’s here, isn’t zero-sum game.
      • Nash (beautiful mind!) Nash Equilibrium is ( pi*A, pi*piB) - such that no player has incentive to change their strategy
        • VA (pi*A, pi*piB) >= VA (piA, pi*piB) for all piA
        • VB (pi*A, pi*piB) >= V*A (piA, piB) for all piB