Dec 6 2024

(Fast.ai)

OOPS! I didn’t finish Lesson 7 that I started on Nov 21st!
Lesson 7
- Road to the top (part 3)
- Key idea – we need to care about the mechanics of the first and last layers of the model, even if we don’t need to understand the internals of the model
- Batching
  - Gradient accumulation - not needing bigger GPUs (more memory) because you you do gradient accumulation (for Convnext, some models might not use normalization)
    - Updates call back only after all the batch is processed (all the gradients are accumulated)
  - Batch size?
    - Rule of thumb - Pick the largest one you can (fits in your GPU) and make it a multiple of 8
    - Rule of thumb - if you divide batch size by 2 you divide the learning rate by 2
  - NVDIA RTX cards for consumers were 3x cheaper than ones in cloud computing environments (just a licensing thing for NVIDIA to make more money!)
Multi-Target: Road to the Top Part 4
Working with two outputs (two targets, two “Ys”)
Key idea – we need to care about the mechanics of the first and last layers of the model, even if we don’t need to understand the internals of the model
Since we have to “Ys”, we need to tell the model how to handle loss (and metrics)
CROSS ENTROPY LOSS -
- Convert model outputs to probabilities
- The Softmax Cross-Entropy Loss is a commonly used loss function in machine learning, particularly in classification problems. It combines the softmax function and the cross-entropy loss into a single step to calculate the error of predicted probabilities compared to the actual labels. When we want to predict one thing and one thing only.
- Start with the SoftMax…
- P(y_i) using SoftMax = exp^z_i / SUM(from j=1 to K) of exp^z_j
  - Numerator
    - exp^z_i
    - exp to the power of the output
  - Denominator
    - K= (in this example, we have 5 categories)
    - SUM of all the exp^z_j where it is exp to the power of the j-th category, in our example
  - Now we are dividing all by the denominator, so everything adds up to 1, and higher output are still larger numbers (e.g., closer to one)
  - For each i we get the predicted probability for class i
- Now we need to do the Cross Entropy Loss. The cross-entropy loss measures the difference between the predicted probability distribution ( P ) and the true distribution ( Q ), which is typically a one-hot encoded vector (where the correct class has a probability of 1, and all others are 0).
  - L = - SUM{from i=1 to C} Q(y_i) log(P(y_i))
- For a single data point (one hot encoded), where the true class is y, the loss simplifies to the following (because it is either 1 and rest are zero with one hot encoded)
  - L = - log(P(y_i))
- Key Idea: It penalizes the model more when the predicted probability for the correct class is lower.
- Getting back to the model parameters for a multi-target (more than one Y), we need to specify the activations to create, so need to provide outputs to create at the last layer (size of the matrix)…so 10 possible diseases and 10 possible varieties, so we need 20 activations.
  - Note, the overall loss function will be the sum of the disease loss and variety loss; the same can be said of the metrics. We need to send the learner for both models the combined losses and combined metrics for both disease and variety.
  - Note: Sometimes multi-target models might do better than a single target model, or they might do worse in comparison to the single target models.
Collaborative Filtering Deep Dive - Example with movie reviews
- Learning the latent factors - (implying features like action, critically acclaimed, dialog, etc.for movies and things like age, edu-level, etc for users)
  - Loss function - mean squared error between predictions and errors
  - Predictions made up of a the dot product of random numbers for both movies and users (using 5 features – to be discovered)
  - Now we can optimize the loss function
- Embeddings -
  - David: Are ‘embeddings’ really as simple as a lookup in an indexed array?
    - ChatGPT 4o:
      - Yes, for static embeddings (e.g., Word2Vec, GloVe), the process during inference is as simple as looking up a vector in an indexed array.
      - …and No, because embeddings are not simply precomputed arrays; they are learned representations optimized during training. For contextual embeddings (e.g., transformer-based models), the process involves dynamically generating embeddings based on input context, which is much more complex than a simple lookup.
      - In essence, the “lookup” explanation is accurate at a functional level but oversimplifies the underlying mechanics of how embeddings are learned and used in modern machine learning models.
  - David: Okay, so the embeddings start out as simple array lookups, but then they start to change as they are optimized through a loss and error metric?
  - ChatGPT 4o: Exactly! Embeddings start as simple array lookups, where each entity (like a word or item) is initialized with a random or heuristic-based vector, but over time, these embeddings evolve and are optimized during training.
- Tried Sigmoid on the loss (since ratings are just 1, 2, 3, 4, 5, but do range from 0 to 5.5)
- Adding User and Movie Bias to see impact on model
- Loss got better and then it got worse → sign of OVERFITTING!!
- WEIGHT DECAY - A classic way to avoid overfitting it to use something called WEIGHT DECAY also know as (AKA) L2 REGULARIZATION
  - Basically, before we compute the gradients, we are going to add to our loss function result, the sum of the weights squared. Why? Because when computing the gradients, it will add a contribution to them that will encourage the weights to be as small as possible.
  - Ironically, the weights being too big, can also lead to overfitting!
  - BALANCE!
- NOTE: In a self-reinforcing system like this, we should probably expect these kinds of feedback loops to be the norm, not the exception. Therefore, you should assume that you will see them, plan for that, and identify up front how you will deal with these issues. Try to think about all of the ways in which feedback loops may be represented in your system, and how you might be able to identify them in your data. In the end, this is coming back to our original advice about how to avoid disaster when rolling out any kind of machine learning system. It’s all about ensuring that there are humans in the loop; that there is careful monitoring, and a gradual and thoughtful rollout.
- The End of Finetuning — with Jeremy Howard of Fast.ai
- Read Chapter 8 of Fast.ai - Collaborative Filtering Book
- Skimmed Chapter 11 of Fast.ai - Using Fast AI’s Mid-level API for more fine-grained control on data loading, transforming, normalizing, and pipelines *