In a preceding section, we noted the crucial role played by the input or training dataset. In this section, we reiterate the importance of this dataset. That said, the training dataset from an ML algorithm standpoint is one that the Random Forest algorithm takes advantage of to train or fit the model by generating the parameters it needs. These are parameters the model needs to come up with the next best-predicted value. In this chapter, we will put the Random Forest algorithm to work on training (and testing) Iris datasets. Indeed, the next paragraph starts with a discussion on Random Forest algorithms or simply Random Forests.
A Random Forest algorithm encompasses decision tree-based supervised learning methods. It can be viewed as a composite whole comprising a large number of decision trees. In ML terminology, a Random Forest is an ensemble resulting from a profusion of decision trees.
A decision tree, as the name implies, is a progressive decision-making process, made up of a root node followed by successive subtrees. The decision tree algorithm snakes its way up the tree, stopping at every node, starting with the root node, to pose a do-you-belong-to-a-certain-category question. Depending on whether the answer is a yes or a no, a decision is made to travel up a certain branch until the next node is encountered, where the algorithm repeats its interrogation. Of course, at each node, the answer received by the algorithm determines the next branch to be on. The final outcome is a predicted outcome on a leaf that terminates.
Speaking of trees, branches, and nodes, the dataset can be viewed as a tree made up of multiple subtrees. Each decision at a node of the dataset and the decision tree algorithm's choice of a certain branch is the result of an optimal composite of feature variables. Using a Random Forest algorithm, multiple decision trees are created. Each decision tree in this ensemble is the outcome of a randomized ordering of variables. That brings us to what random forests are—an ensemble of a multitude of decision trees.
It is to be noted that one decision tree by itself cannot work well for a smaller sample like the Iris dataset. This is where the Random Forest algorithm steps in. It brings together or aggregates all of the predictions from its forest of decision trees. All of the aggregated results from individual decision trees in this forest would form one ensemble, better known as a Random Forest.
We chose the Random Forest method to make our predictions for a good reason. The net prediction formed out of an ensemble of predictions is significantly more accurate.
In the next section, we will formulate our classification problem, and in the Getting started with Spark section that follows, implementation details for the project are given.