-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Pandas 1.x Cookbook
By :

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> (
... college_ugds.isnull()
... .sum(axis="columns")
... .sort_values(ascending=False)
... .head()
... )
INSTNM
Excel Learning Center-San Antonio South 9
Philadelphia College of Osteopathic Medicine 9
Assemblies of God Theological Seminary 9
Episcopal Divinity School 9
Phillips Graduate Institute 9
dtype: int64
.dropna
method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:
>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
.ge
, to return a DataFrame with a Boolean value for each cell:
>>> college_ugds.ge(0.15)
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False True ... False False
Universit... True True ... False False
Amridge U... True True ... False True
Universit... True False ... False False
Alabama S... False True ... False False
... ... ... ... ... ...
Hollywood... True True ... False False
Hollywood... False True ... False False
Coachella... True False ... False False
Dewey Uni... False False ... False False
Coastal P... True True ... False False
.sum
method to count the True
values for each college. Notice that a Series is returned:
>>> diversity_metric = college_ugds.ge(0.15).sum(
... axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University 1
University of Alabama at Birmingham 2
Amridge University 3
University of Alabama in Huntsville 1
Alabama State University 1
dtype: int64
.value_counts
method on this Series:
>>> diversity_metric.value_counts()
1 3042
2 2884
3 876
4 63
0 7
5 2
dtype: int64
diversity_metric
Series to find out which ones they are:
>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin 5
Central Texas Beauty College-Temple 5
Sullivan and Cogliano Training Center 4
Ambria College of Nursing 4
Berkeley College-New York 4
dtype: int64
.loc
to select rows based on the index label:
>>> college_ugds.loc[
... [
... "Regency Beauty Institute-Austin",
... "Central Texas Beauty College-Temple",
... ]
... ]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Regency B... 0.1867 0.2133 ... 0.0 0.2667
Central T... 0.1616 0.2323 ... 0.0 0.1515
>>> us_news_top = [
... "Rutgers University-Newark",
... "Andrews University",
... "Stanford University",
... "University of Houston",
... "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark 4
Andrews University 3
Stanford University 3
University of Houston 3
University of Nevada-Las Vegas 3
dtype: int64
Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna
method in step 3 has the how
parameter, which defaults to the string 'any'
, but may also be changed to 'all'
. When set to 'any'
, it drops rows that contain one or more missing values. When set to 'all'
, it only drops rows where all values are missing.
In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna
method was performed. If there were still missing values, we could have run the .fillna(0)
method to fill all the remaining values with 0.
Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge
. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'
.
The .value_counts
method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.
Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.
Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
We can also determine if any school has all nine race categories exceeding 1%:
>>> (college_ugds > 0.01).all(axis=1).any()
True