-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Essential PySpark for Scalable Data Analytics
By :

Feature transformation is the process of carefully reviewing the various variable types, such as categorical variables and continuous variables, present in the training data and determining the best type of transformation to achieve optimal model performance. This section will describe, with code examples, how to transform a few common types of variables found in machine learning datasets, such as text and numerical variables.
Categorical variables are pieces of data that have discrete values with a limited and finite range. They are usually text-based in nature, but they can also be numerical. Examples include country codes and the month of the year. We mentioned a few techniques regarding how to extract features from text variables in the previous section. In this section, we will explore a few other algorithms to transform categorical variables.
The Tokenizer
class...