
PySpark Cookbook
By :

Jumping straight into modeling the data is a misstep almost every new data scientist makes; we get too eager to get to the reward stage, so we forget about the fact that most of the time is actually spent doing the boring stuff of cleaning up our data and getting familiar with it. In this recipe, we will explore the census dataset.
To execute this recipe, you need to have a working Spark environment. You should have already gone through the previous recipe where we loaded the census data into a DataFrame.
No other prerequisites are required.
First, we list all the columns we want to keep:
cols_to_keep = census.dtypes cols_to_keep = ( ['label','age' ,'capital-gain' ,'capital-loss' ,'hours-per-week' ] + [ e[0] for e in cols_to_keep[:-1] if e[1] == 'string' ] )
Next, we select the numerical and categorical features as we will be exploring these separately:
census_subset = census.select(cols_to_keep) cols_num...