Nov 21 2024

(Stanford)

Search 2 - A* | Stanford CS221: Artificial Intelligence (Autumn 2019)

  • Here is a great video explaining A* Search
  • Can I use X algo? What’s the cost? Is it cyclical? (could it be broken up into backwards/forward – so linear but add ‘direction’ to state)
  • Where do you get the heuristic? Well, you could try first by making the problem simpler by removing constraints (knock down walls, allow tiles to overlap, etc.)
  • The heuristic needs to be tested for consistency [ The first inequality follows because h(s) = FutureCostrel(s) , and all future costs correspond to the minimum cost paths. So taking action a from state s better be less than taking the best action from state s (this is all in the search problem Prel ).)
  • Tradeoffs:
    • Efficiency → h(s) = FutureCostrel(s) must be easy to compute [Closed form, easier search, independent subproblems]
    • Tightness → heuristic h(s) should be close to FutureCost(s) (don’t remove too many constraints)
    • So the art of designing heuristics is to balance informativeness with computational efficiency.
    • You can have more than one heuristic, and you just take the max of each (you don’t have to choose just one)
  • Now turn to “learning” when you have a search problem
    • Going back to transportation example (walk = s to s+1, Tram = s to 2s
    • But if we don’t know the cost, but say we do know the optimal path, how can we learn the costs?
    • Learning is the inverse of Search
      • Forward problem (search): Cost (s, a) –> (a1, a2, …ak)
      • Inverse problem (learning): (a1, a2, …ak) –> Cost (s, a)
      • Structured Perceptron (simplified) See Slide 108
      • Structured Perceptron (updated by Collins, 2002) - See slide 112

(Fast.ai)

Lesson 7

  • Road to the top (part 3)
  • Gradient accumulation - not needing bigger GPUs (more memory) because you do gradient accumulation (for Convnext)
  • Batch size? Pick the largest one you can (fits in your GPU) and make it a multiple of 8
    • Rule of thumb - if you divide batch size by 2 you divide the learning rate by 2