Nov 18 2024

(Fast.ai)

Lesson 5 (Chapter 9) - Linear and Neural networks, Sigmoid, Gradient Descent Step, random forests
- Showed how matrix multiplication, 2 layer neural net, and deep learning were about the same, but that’s because of the simplicity of the data and features. But helpful because you might get away with a lot more cheaper model option than using deep learning…but most of the time, you really just want to jump to deep learning framework, except with Tabular data.
- Linear Model and Neural Net from Scratch (in Kaggle this time, instead of a spreadsheet) https://github.com/fastai/course22/blob/master/clean/05-linear-model-and-neural-net-from-scratch.ipynb
- Why use a Framework - https://github.com/fastai/course22/blob/master/clean/06-why-you-should-use-a-framework.ipynb
- We started on Random Forrest (binary splits / 1R ) https://github.com/fastai/course22/blob/master/clean/07-how-random-forests-really-work.ipynb
  
  Side note: Chapter 9 is the longest chapter in the book!! 10k lines whereas the next closest chapter was 5k lines
Chapter 9 Reading
- The good news is that modern machine learning can be distilled down to a couple of key techniques that are widely applicable. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:
  1. Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data (such as you might find in a database table at most companies)
  2. Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)
- Although deep learning is nearly always clearly superior for unstructured data, these two approaches tend to give quite similar results for many kinds of structured data. But ensembles of decision trees tend to train faster, are often easier to interpret, do not require special GPU hardware for inference at scale, and often require less hyperparameter tuning. They have also been popular for quite a lot longer than deep learning, so there is a more mature ecosystem of tooling and documentation around them.
- Most importantly, the critical step of interpreting a model of tabular data is significantly easier for decision tree ensembles. There are tools and methods for answering the pertinent questions, like: Which columns in the dataset were the most important for your predictions? How are they related to the dependent variable? How do they interact with each other? And which particular features were most important for some particular observation?
- Therefore, ensembles of decision trees are our first approach for analyzing a new tabular dataset.
  - The exception to this guideline is when the dataset meets one of these conditions:
    - There are some high-cardinality categorical variables that are very important (“cardinality” refers to the number of discrete levels representing categories, so a high-cardinality categorical variable is something like a zip code, which can take on thousands of possible levels).
    - There are some columns that contain data that would be best understood with a neural network, such as plain text data.
  - In practice, when we deal with datasets that meet these exceptional conditions, we always try both decision tree ensembles and deep learning to see which works best. But start with Random Forest (Decision trees)
Random Forests “Bagging Predictors”, - It is based on a deep and important insight: although each of the models trained on a subset of data will make more errors than a model trained on the full dataset, those errors will not be correlated with each other. Different models will make different errors. The average of those errors, therefore, is: zero!
- So if we take the average of all of the models’ predictions, then we should end up with a prediction that gets closer and closer to the correct answer, the more models we have. This is an extraordinary result—it means that we can improve the accuracy of nearly any kind of machine learning algorithm by training it multiple times, each time on a different random subset of the data, and averaging its predictions.
- In 2001 Leo Breiman went on to demonstrate that this approach to building models, when applied to decision tree building algorithms, was particularly powerful. He went even further than just randomly choosing rows for each model’s training, but also randomly selected from a subset of columns when choosing each split in each decision tree. He called this method the random forest. Today it is, perhaps, the most widely used and practically important machine learning method.
There’s something like Bagging, and that’s Boosting, where you have shallow trees and you get the residual, and you then use the next set of shallow trees to try and predict that new residual. And instead of averaging (like in bagging) you sum the model results.
- See https://explained.ai/gradient-boosting/
- Boosting is typically more accurate that random forest but you can absolutely overfit (and there are ways to avoid such, but because it is breakable, it’s not Jeremy’s first choice).
Out-of-Bag Error - Recall that in a random forest, each tree is trained on a different subset of the training data. The OOB error is a way of measuring prediction error on the training set by only including in the calculation of a row’s error trees where that row was not included in training. This allows us to see whether the model is overfitting, without needing a separate validation set.
- A: My intuition for this is that, since every tree was trained with a different randomly selected subset of rows, out-of-bag error is a little like imagining that every tree therefore also has its own validation set. That validation set is simply the rows that were not selected for that tree’s training.

Conclusion: Our Advice for Tabular Modeling

We have discussed two approaches to tabular modeling: decision tree ensembles and neural networks. We’ve also mentioned two different decision tree ensembles: 1) random forests, and 2) gradient boosting machines. Each is very effective, but each also has compromises:
- Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.
- Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.
- Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.
We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it’s a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.
From that foundation, you can try neural nets and GBMs, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them.
If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data, and see if that helps your decision trees learn better.