
Learning PySpark
By :

Therefore, it is time to create our final dataset that we will use to build our models. We will convert our DataFrame into an RDD of LabeledPoints
.
A LabeledPoint
is a MLlib structure that is used to train the machine learning models. It consists of two attributes: label
and features
.
The label
is our target variable and features
can be a NumPy array
, list
, pyspark.mllib.linalg.SparseVector
, pyspark.mllib.linalg.DenseVector
, or scipy.sparse
column matrix.
Before we build our final dataset, we first need to deal with one final obstacle: our 'BIRTH_PLACE'
feature is still a string. While any of the other categorical variables can be used as is (as they are now dummy variables), we will use a hashing trick to encode the 'BIRTH_PLACE'
feature:
import pyspark.mllib.feature as ft import pyspark.mllib.regression as reg hashing = ft.HashingTF(7) births_hashed = births_transformed \ .rdd \ .map(lambda row: [ list(hashing.transform...