
Data Cleaning and Exploration with Machine Learning
By :

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.
Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.
Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts
method to generate counts for each categorical value:
nls97 = pd.read_csv("data/nls97.csv") nls97.set_index("personid", inplace=True) nls97abb = nls97.iloc[:,:20] nls97abb.dtypes gender object birthmonth int64 birthyear int64 highestgradecompleted float64 maritalstatus object childathome float64 childnotathome float64 wageincome float64 weeklyhrscomputer object weeklyhrstv object nightlyhrssleep float64 satverbal float64 satmath float64 gpaoverall float64 gpaenglish float64 gpamath float64 gpascience float64 highestdegree object govprovidejobs object govpricecontrols object dtype: object
Note
Recall from the previous section how column and row selection works with the loc
and iloc
accessors. The colon to the left of the comma indicates that we want all the rows, while :20
to the right of the comma gets us the first 20 columns.
value_counts
to see the counts for each value for maritalstatus
. We can also use dropna=False
to get value_counts
to show the missing values (NaN
):nls97abb.maritalstatus.value_counts(dropna=False) Married 3066 Never-married 2766 NaN 2312 Divorced 663 Separated 154 Widowed 23 Name: maritalstatus, dtype: int64
isnull
and sum
methods. isnull
returns a Boolean Series containing True
values when maritalstatus
is missing and False
otherwise. sum
then counts the number of True
values, since it will interpret True
values as 1 and False
values as 0:nls97abb.maritalstatus.isnull().sum() 2312
maritalstatus
values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts
returns a Series with the values as the index:marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False) type(marstatcnt) <class 'pandas.core.series.Series'> marstatcnt.index Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
sort_index
:marstatcnt.sort_index() Divorced 663 Married 3066 Never-married 2766 Separated 154 Widowed 23 NaN 2312 Name: maritalstatus, dtype: int64
nls97.maritalstatus.value_counts(dropna=False).sort_index()
. We can also show ratios instead of counts by setting normalize
to True
. In the following code, we can see that 34% of the responses were Married
(notice that we did not set dropna
to True
, so missing values have been excluded):nls97.maritalstatus.\ value_counts(normalize=True, dropna=False).\ sort_index() Divorced 0.07 Married 0.34 Never-married 0.31 Separated 0.02 Widowed 0.00 NaN 0.26 Name: maritalstatus, dtype: float64
catcols
. Then, we're looping through those columns and using astype
to change the data type to category
:catcols = nls97abb.select_dtypes(include=["object"]).columns for col in nls97abb[catcols].columns: ... nls97abb[col] = nls97abb[col].astype('category') ... nls97abb[catcols].dtypes gender category maritalstatus category weeklyhrscomputer category weeklyhrstv category highestdegree category govprovidejobs category govpricecontrols category dtype: object
gender
and very few for highestdegree
. But the overwhelming majority of values for govprovidejobs
(the government should provide jobs) and govpricecontrols
(the government should control prices) are missing. This means that those features probably won't be useful for most modeling:nls97abb[catcols].isnull().sum() gender 0 maritalstatus 2312 weeklyhrscomputer 2274 weeklyhrstv 2273 highestdegree 31 govprovidejobs 7151 govpricecontrols 7125 dtype: int64
value_counts
call to apply
. We can use filter
to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna
to False
:nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.25 0.54 2. Probably 0.34 0.33 3. Probably not 0.25 0.09 4. Definitely not 0.16 0.04
nls97abb[nls97abb.maritalstatus=="Married"]
before filter
:nls97abb.loc[nls97abb.maritalstatus=="Married"].\ filter(like="gov").\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
nls97abb.loc[nls97abb.maritalstatus=="Married", ['govprovidejobs','govpricecontrols']].\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
Nonetheless, it will often be easier to use filter
since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.
There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted
. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted
is missing.
highestgradecompleted
has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace
to fix that before moving on:nls97abb.highestgradecompleted.\ replace(95, np.nan, inplace=True)
where
function to assign values to highschoolgrad
based on the values of highestgradecompleted
. If highestgradecompleted
is null (NaN
), we assign NaN
to our new column, highschoolgrad
. If the value for highestgradecompleted
is not null, the next clause tests for a value less than 12, setting highschoolgrad
to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad
, contains the values we want by using groupby
to get the min and max values of highestgradecompleted
at each level of highschoolgrad
:nls97abb['highschoolgrad'] = \ np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \ np.where(nls97abb.highestgradecompleted<12,0,1)) nls97abb.groupby(['highschoolgrad'], dropna=False) \ ['highestgradecompleted'].agg(['min','max','size']) min max size highschoolgrad 0 5 11 1231 1 12 20 5421 nan nan nan 2332 nls97abb['highschoolgrad'] = \ ... nls97abb['highschoolgrad'].astype('category')
While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad
, this would present some modeling challenges if we intended to use highschoolgrad
as a target. There is a pretty substantial class imbalance, with highschoolgrad
equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted
.
qcut
function. We can set the q
parameter of qcut
to 6
to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:nls97abb['highgradegroup'] = \ pd.qcut(nls97abb['highestgradecompleted'], q=6, labels=[1,2,3,4,5,6]) nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\ agg(['min','max','size']) min max size highgradegroup 1 5 11 1231 2 12 12 1389 3 13 14 1288 4 15 16 1413 5 17 17 388 6 18 20 943 nls97abb['highgradegroup'] = \ nls97abb['highgradegroup'].astype('category')
value_counts
:freqout = open('views/frequencies.txt', 'w') for col in nls97abb.select_dtypes(include=["category"]): print(col, "----------------------", "frequencies", nls97abb[col].value_counts(dropna=False).sort_index(), "percentages", nls97abb[col].value_counts(normalize=True).\ sort_index(), sep="\n\n", end="\n\n\n", file=freqout) freqout.close()
These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts
method. We can use value_counts
to create frequencies a Series at a time, use it with apply
for multiple columns, or iterate over several columns and call value_counts
each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.
Change the font size
Change margin width
Change background colour