Day 18 - Stanford AI: A* Search, Heuristics and Fast.ai Lesson 7 (Road to the Top part 3)”
Nov 21 2024
(Stanford)
Search 2 - A* | Stanford CS221: Artificial Intelligence (Autumn 2019)
- Here is a great video explaining A* Search
- Can I use X algo? What’s the cost? Is it cyclical? (could it be broken up into backwards/forward – so linear but add ‘direction’ to state)
- Where do you get the heuristic? Well, you could try first by making the problem simpler by removing constraints (knock down walls, allow tiles to overlap, etc.)
- The heuristic needs to be tested for consistency [ The first inequality follows because h(s) = FutureCostrel(s) , and all future costs correspond to the minimum cost paths. So taking action a from state s better be less than taking the best action from state s (this is all in the search problem Prel ).)
- Tradeoffs:
- Efficiency → h(s) = FutureCostrel(s) must be easy to compute [Closed form, easier search, independent subproblems]
- Tightness → heuristic h(s) should be close to FutureCost(s) (don’t remove too many constraints)
- So the art of designing heuristics is to balance informativeness with computational efficiency.
- You can have more than one heuristic, and you just take the max of each (you don’t have to choose just one)
- Now turn to “learning” when you have a search problem
- Going back to transportation example (walk = s to s+1, Tram = s to 2s
- But if we don’t know the cost, but say we do know the optimal path, how can we learn the costs?
- Learning is the inverse of Search
- Forward problem (search): Cost (s, a) –> (a1, a2, …ak)
- Inverse problem (learning): (a1, a2, …ak) –> Cost (s, a)
- Structured Perceptron (simplified) See Slide 108
- Structured Perceptron (updated by Collins, 2002) - See slide 112
(Fast.ai)
Lesson 7
- Road to the top (part 3)
- Gradient accumulation - not needing bigger GPUs (more memory) because you do gradient accumulation (for Convnext)
- Batch size? Pick the largest one you can (fits in your GPU) and make it a multiple of 8
- Rule of thumb - if you divide batch size by 2 you divide the learning rate by 2