Ensembling is another type of supervised learning. It combines the predictions of multiple machine learning models that are individually weak to produce a more accurate prediction on a new sample. By combining individual models, the ensemble model tends to be more flexible🤸♀️ (less bias) and less data-sensitive🧘♀️ (less variance).
The idea is that ensembles of learners perform better than single learners.
In the next two lessons we will learn about two ensemble techniques, bagging with random forests and boosting with XGBoost.
What does bagging mean?
Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data.
How does the Random Forest model work?
To understand the random forest model, we first learned about the decision tree, the basic building block of a random forest. We all use decision trees in our daily life, and even if you don’t know it, you’ll recognize the process.
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
Some deeper explanation:
Unlike a decision tree, where each node is split on the best feature that minimizes error, in Random Forests, we choose a random selection of features for constructing the best split. The reason for randomness is: even with bagging, when decision trees choose the best feature to split on, they end up with similar structure and correlated predictions. But bagging after splitting on a random subset of features means less correlation among predictions from subtrees.
The number of features to be searched at each split point is specified as a parameter to the Random Forest algorithm.
Thus, in bagging with Random Forest, each tree is constructed using a random sample of records and each split is constructed using a random sample of predictors.
To clarify the difference between them, random Forest is an ensemble method that uses bagged decision trees with random feature subsets chosen at each split point. It then either averages the prediction results of each tree (regression) or using votes from each tree (classification) to make the final prediction.
The reason why they work so well: 'A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models'. Low correlation is the key.
Always a good place to start is reading the documentation in scikit learn: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
The most important settings are:
num estimators - the number of decision trees in the forest
max features - maximum number of features that are evaluated for splitting at each node
But we can try adjusting a wide range of values in other hyperparameters like:
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)
Let's see how could we implement a RandomizedSearchCV to find optimal hyperparameters:
On each iteration, the algorithm will choose different combinations of the features. Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings! However, the benefit of a random search is that we are not trying every combination, but we are selecting at random to sample a wide range of values.
No, random forest models are generally not prone to overfitting because the bagging and randomized feature selection tends to average out any noise in the model. Adding more trees does not cause overfitting since the randomization process continues to average out noise (more trees generally reduces overfitting in random forest).
In general, bagging algorithms are robust to overfitting.
Having said that, it is possible to overfit with random forest models if the underlying decision trees have extremely high variance. Extremely high depth and low min sample split, and a large percentage of features are considered at each split point. For example if every tree is identical, then random forest may overfit the data.
How can my random forest make accurate class predictions?
We need features that have at least some predictive power.
The trees of the forest and their predictions need to be uncorrelated (at least low correlations). Features and hyperparameters selected will impact ultimate correlations.