
PySpark Cookbook
By :

In this recipe, we will build two regression models that will predict forest elevation: the random forest regression model and the gradient-boosted trees regressor.
To execute this recipe, you will need a working Spark environment and you would have already loaded the data into the forest
DataFrame.
No other prerequisites are required.
In this recipe, we will only build a two stage Pipeline with the .VectorAssembler(...)
and the .RandomForestRegressor(...)
stages. We will skip the feature selection stage as it is not currently an automated process.
You can do this manually. Just check the Selecting the most predictable features recipe earlier from in this chapter.
Here's the full code:
vectorAssembler = feat.VectorAssembler( inputCols=forest.columns[1:] , outputCol='features') rf_obj = rg.RandomForestRegressor( labelCol='Elevation' , maxDepth=10 , minInstancesPerNode=10 , minInfoGain=0.1 , numTrees=10 ) ...