
PySpark Cookbook
By :

Machine learning (ML) is a field of study that aims at using machines (computers) to understand world phenomena and predict their behavior. In order to build an ML model, all our data needs to be numeric. Since almost all of our features are categorical, we need to transform our features. In this recipe, we will learn how to use a hashing trick and dummy encoding.
To execute this recipe, you need to have a working Spark environment. You would have already gone through the Loading the data recipe where we loaded the census data into a DataFrame.
No other prerequisites are required.
We will be reducing the dimensionality of our dataset roughly by half, so first we need to extract the total number of distinct values in each column:
len_ftrs = [] for col in cols_cat: ( len_ftrs .append( (col , census .select(col) .distinct() .count() ) ...