import numpy as np
import pandas as pd
import seaborn as sns
pd.options.mode.copy_on_write = "warn"
pd.set_option("display.max_rows", 10)7 Intro to Exploratory Data Analysis with Python
In this chapter, we will learn the basics of exploratory data analysis in Python using Pandas (and just a tiny bit of seaborn).
7.1 Introduction
Exploratory Data Analysis, or EDA for short, is the process of building an understanding of your data. Before jumping into complex statistical analyses or building predictive models, EDA helps you understand what your data actually contains. It’s about visually and statistically summarizing your dataset, identifying patterns, spotting anomalies, and generating hypotheses.
EDA is a critical step in discovery-based research (sometimes known as foundational or exploratory research). As biologists, you will be familiar with hypothesis-driven research, whereby you start with the answer (the hypothesis), and try to work back to either prove or disprove it using the scientific method. Discovery-based research fits in even before hypothesis-driven research can begin, and is especially useful in cases where we know so little about the topic or system in question that we can’t craft useful hypotheses. One of its main goals is to build understanding of complex systems and generate hypotheses that can be tested in the more classical style of hypothesis-driven research, and EDA is a critical step in this process.
While EDA is often closely connected with discovery-based research, it is important to note that it is also a critical aspect of hypothesis-driven processes as well. For example, EDA can be a powerful tool for data quality control and assumption checking. It’s important to identify missing values, outliers, or other bad data that could compromise your analysis. Further, many statistical methods have assumptions about your data (like normality or constant variance of errors). EDA helps you verify if these assumptions are reasonable. Without proper exploration of your data, you might miss critical insights or, worse, draw incorrect conclusions from your analyses.
EDA has a role in helping you to build an intuition for your problem domain and your data. Regularly engaging with EDA will help you get a “feel” for your data and better understand its strengths and limitations. This intuition is critical for effective communication of your findings and for productive discussion of your data and problem domain with collaborators and stakeholders, or in publications.
Important Python libraries for doing EDA
Python has a strong set of libraries for exploratory data analysis (EDA). Here are some of the more common ones:
- Pandas: Essential for working with tabular data, offering powerful DataFrame operations.
- NumPy: Provides fast array operations, forming the backbone of numerical computing in Python.
- SciPy: Useful for advanced statistical analysis and scientific computing.
- Statsmodels: Extends on the statistical models provided by SciPy and provides and alternative interface.
- Matplotlib: A versatile library for creating static, animated, and interactive plots.
- Seaborn: Simplifies statistical visualization with built-in themes and functions.
- Jupyter and Quarto: Computational notebooks
There are many more, but you will see these popping up again and again.
Why We’re Using Pandas
We’re using Pandas in this chapter because:
- Works well with tabular data: Most biological data is structured like a spreadsheet or database table, and Pandas is built for handling this format.
- Widely used: It’s a common tool in both academia and industry.
- Relatively easy to use: Pandas provides a straightforward way to explore and manipulate data.
- Has useful built-ins: Filtering, grouping, summarizing, and plotting data often take just a few lines of code.
- Plays well with others: Pandas integrates smoothly with visualization and statistical tools.
- Uses similar concepts to R’s tidyverse, which many of you have experience with from your previous coursework
We’ll also use a bit of seaborn for visualization, as it can help with certain types of plots or when data is in a certain format.
Practical Examples
In this chapter, we’ll learn exploratory data analysis (EDA) by working through real research questions with real datasets. Instead of covering every Pandas function upfront, we’ll introduce tools as we need them.
We’ll use datasets from the CORGIS collection, which offers accessible real-world data. Our examples include:
- State Demographics: Analyzing population patterns and economic indicators across the U.S.
- Cancer Statistics: Examining cancer rates and their potential links to demographics.
- Vaccination Impact: Exploring historical disease data to see how vaccines have shaped public health.
These examples will help you learn Pandas in context, and use techniques that are similar to those you could use when getting started with a real research project.
Key Pandas Functionality
After working through the examples, we’ll summarize the essential Pandas operations, including:
- Loading and examining data
- Selecting, filtering, and sorting
- Grouping and aggregating
- Basic visualization
- Merging and joining datasets
By focusing on core functions, you’ll gain practical skills without getting lost in the details. Let’s dive in!
7.2 Import Needed Libraries
The first thing we need to do is to import some libraries. We are only using numpy for a couple things in this chapter: specifying data types the NaN value.
Note: While we don’t need to import it, you will also need to have SciPy installed to run the clustered heatmaps.
7.3 State Demographics Data
To start, we are going to look at the state demographics data from The Collection of Really Great, Interesting, Situated Datasets (CORGIS). CORGIS is a collection of datasets that have been cleaned and otherwise made ready for teaching/learning purposes. It was created in part by Dr. Cory Bart, who is a professor at UD.
This state demographics data includes a lot of info about states that we will be able to use to try and explain some of the cancer trends that we see in the next section. To give you an idea of the kinds of data we’ll be working with, here are some of the data categories:
- Population
- Age
- Ethnicities
- Housing
- Income
- Employment
Importing Data
The first thing we need to do is import the data. To do that, we can use the read_csv() function:
state_demographics = pd.read_csv("../_data/state_demographics.csv")Data Overview
After importing data, it’s always a good idea to check out its basic info, things like shape, column names, basic summary statistics, etc.
To get the number of rows and columns in a data frame, we use [shape()][https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html]:
state_demographics.shape(51, 48)That’s not too much data, so let’s look at the table directly. We will use the head() function to take only the first few rows:
state_demographics.head()| State | Population.Population Percent Change | Population.2014 Population | Population.2010 Population | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Miscellaneous.Percent Female | Ethnicities.White Alone | Ethnicities.Black Alone | ... | Employment.Nonemployer Establishments | Employment.Firms.Total | Employment.Firms.Men-Owned | Employment.Firms.Women-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | Population.Population per Square Mile | Miscellaneous.Land Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Connecticut | -10.2 | 3605944 | 3574097 | 5.1 | 20.4 | 17.7 | 51.2 | 79.7 | 12.2 | ... | 286874 | 326693 | 187845 | 106678 | 56113 | 259614 | 31056 | 281182 | 738.1 | 4842.36 | 
| 1 | Delaware | 8.4 | 989948 | 897934 | 5.6 | 20.9 | 19.4 | 51.7 | 69.2 | 23.2 | ... | 68623 | 73418 | 38328 | 23964 | 14440 | 54782 | 7206 | 60318 | 460.8 | 1948.54 | 
| 2 | District of Columbia | 17.3 | 689545 | 601723 | 6.4 | 18.2 | 12.4 | 52.6 | 46.0 | 46.0 | ... | 62583 | 63408 | 30237 | 27064 | 29983 | 29521 | 5070 | 54217 | 9856.5 | 61.05 | 
| 3 | Florida | 14.2 | 21538187 | 18801310 | 5.3 | 19.7 | 20.9 | 51.1 | 77.3 | 16.9 | ... | 2388050 | 2100187 | 1084885 | 807817 | 926112 | 1121749 | 185756 | 1846686 | 350.6 | 53624.76 | 
| 4 | Georgia | 9.6 | 10711908 | 9687653 | 6.2 | 23.6 | 14.3 | 51.4 | 60.2 | 32.6 | ... | 955621 | 929864 | 480578 | 376506 | 371588 | 538893 | 96787 | 800585 | 168.4 | 57513.49 | 
5 rows × 48 columns
To get summary statistics of the numeric rows of a data frame, we use describe(). This can help you to get an overall sense of your data.
state_demographics.describe()| Population.Population Percent Change | Population.2014 Population | Population.2010 Population | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Miscellaneous.Percent Female | Ethnicities.White Alone | Ethnicities.Black Alone | Ethnicities.American Indian and Alaska Native Alone | ... | Employment.Nonemployer Establishments | Employment.Firms.Total | Employment.Firms.Men-Owned | Employment.Firms.Women-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | Population.Population per Square Mile | Miscellaneous.Land Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.000000 | 5.100000e+01 | 5.100000e+01 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | ... | 5.100000e+01 | 5.100000e+01 | 5.100000e+01 | 5.100000e+01 | 5.100000e+01 | 5.100000e+01 | 51.000000 | 5.100000e+01 | 51.000000 | 51.000000 | 
| mean | 5.147059 | 6.499006e+06 | 6.053834e+06 | 5.958824 | 22.139216 | 16.878431 | 50.598039 | 78.068627 | 11.872549 | 2.005882 | ... | 5.193242e+05 | 5.452549e+05 | 2.923801e+05 | 1.938421e+05 | 1.560321e+05 | 3.737521e+05 | 49611.882353 | 4.733260e+05 | 384.403922 | 69253.047843 | 
| std | 6.870760 | 7.408023e+06 | 6.823984e+06 | 0.607018 | 1.996805 | 2.008812 | 0.836777 | 13.024907 | 10.704057 | 3.105441 | ... | 6.688605e+05 | 6.614342e+05 | 3.524479e+05 | 2.468993e+05 | 2.992485e+05 | 3.615840e+05 | 51941.581563 | 5.892086e+05 | 1377.354603 | 85526.076023 | 
| min | -13.300000 | 5.768510e+05 | 5.636260e+05 | 4.700000 | 18.200000 | 11.400000 | 47.900000 | 25.500000 | 0.600000 | 0.300000 | ... | 5.304200e+04 | 6.242700e+04 | 3.003900e+04 | 1.934400e+04 | 2.354000e+03 | 2.952100e+04 | 5070.000000 | 5.135300e+04 | 1.200000 | 61.050000 | 
| 25% | 1.950000 | 1.816411e+06 | 1.696962e+06 | 5.700000 | 21.050000 | 16.100000 | 50.200000 | 71.250000 | 3.650000 | 0.500000 | ... | 1.216330e+05 | 1.431060e+05 | 7.392400e+04 | 4.478700e+04 | 1.472200e+04 | 1.261350e+05 | 14892.500000 | 1.200765e+05 | 45.800000 | 33334.515000 | 
| 50% | 4.100000 | 4.505836e+06 | 4.339367e+06 | 6.000000 | 22.100000 | 16.900000 | 50.700000 | 80.600000 | 8.500000 | 0.700000 | ... | 3.026530e+05 | 3.393050e+05 | 1.878450e+05 | 1.230150e+05 | 6.125200e+04 | 2.762690e+05 | 36273.000000 | 2.887900e+05 | 101.200000 | 53624.760000 | 
| 75% | 9.800000 | 7.428392e+06 | 6.636084e+06 | 6.200000 | 23.250000 | 17.850000 | 51.200000 | 86.900000 | 16.300000 | 1.600000 | ... | 5.568705e+05 | 5.790585e+05 | 3.276305e+05 | 2.041645e+05 | 1.288060e+05 | 4.462370e+05 | 58167.500000 | 4.975955e+05 | 221.450000 | 80692.730000 | 
| max | 17.300000 | 3.953822e+07 | 3.725396e+07 | 7.700000 | 29.000000 | 21.200000 | 52.600000 | 94.400000 | 46.000000 | 15.600000 | ... | 3.453769e+06 | 3.548449e+06 | 1.852580e+06 | 1.320085e+06 | 1.619857e+06 | 1.819107e+06 | 252377.000000 | 3.176341e+06 | 9856.500000 | 570640.950000 | 
8 rows × 47 columns
Do you notice how the values in the table have a lot of precision, and are all using scientific notation? Sometimes this is what we want, but we really don’t need all that here, and it’s only serving to clutter up the view. We can control the precision of the numbers in the table using the .style.format() pattern. Since we have some large numbers in there, let’s add a thousands place separator as well.
state_demographics.describe().style.format(
    # Set precision of numbers to 2 decimal places
    precision=2,
    # Use a comma to separate out the thousands in the big numbers
    thousands=",",
)| Population.Population Percent Change | Population.2014 Population | Population.2010 Population | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Miscellaneous.Percent Female | Ethnicities.White Alone | Ethnicities.Black Alone | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Native Hawaiian and Other Pacific Islander Alone | Ethnicities.Two or More Races | Ethnicities.Hispanic or Latino | Ethnicities.White Alone, not Hispanic or Latino | Miscellaneous.Veterans | Miscellaneous.Foreign Born | Housing.Housing Units | Housing.Homeownership Rate | Housing.Median Value of Owner-Occupied Units | Housing.Households | Housing.Persons per Household | Miscellaneous.Living in Same House +1 Years | Miscellaneous.Language Other than English at Home | Housing.Households with a computer | Housing.Households with a Internet | Education.High School or Higher | Education.Bachelor's Degree or Higher | Miscellaneous.Percent Under 66 Years With a Disability | Miscellaneous.Percent Under 65 Years Without Health insurance | Sales.Accommodation and Food Services Sales | Miscellaneous.Manufacturers Shipments | Sales.Retail Sales | Miscellaneous.Mean Travel Time to Work | Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | Employment.Nonemployer Establishments | Employment.Firms.Total | Employment.Firms.Men-Owned | Employment.Firms.Women-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | Population.Population per Square Mile | Miscellaneous.Land Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 49.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 
| mean | 5.15 | 6,499,005.51 | 6,053,834.08 | 5.96 | 22.14 | 16.88 | 50.60 | 78.07 | 11.87 | 2.01 | 4.55 | 0.40 | 3.11 | 12.25 | 67.67 | 357,457.29 | 9.41 | 2,738,906.75 | 65.67 | 233,176.47 | 2,367,765.65 | 2.56 | 85.24 | 14.98 | 89.98 | 82.05 | 89.54 | 31.77 | 9.16 | 9.96 | 13,885,070.55 | 116,411,562.24 | 82,741,605.31 | 24.80 | 63,097.86 | 33,743.06 | 12.17 | 519,324.16 | 545,254.90 | 292,380.10 | 193,842.12 | 156,032.14 | 373,752.12 | 49,611.88 | 473,326.00 | 384.40 | 69,253.05 | 
| std | 6.87 | 7,408,022.55 | 6,823,984.27 | 0.61 | 2.00 | 2.01 | 0.84 | 13.02 | 10.70 | 3.11 | 5.52 | 1.41 | 3.21 | 10.35 | 16.18 | 351,951.29 | 6.11 | 2,888,415.00 | 5.44 | 109,205.70 | 2,531,758.54 | 0.17 | 2.06 | 9.77 | 2.55 | 3.79 | 2.70 | 6.43 | 1.77 | 3.64 | 16,274,397.82 | 131,704,461.49 | 91,289,768.45 | 3.98 | 10,715.13 | 5,689.58 | 2.68 | 668,860.54 | 661,434.25 | 352,447.95 | 246,899.31 | 299,248.47 | 361,584.05 | 51,941.58 | 589,208.57 | 1,377.35 | 85,526.08 | 
| min | -13.30 | 576,851.00 | 563,626.00 | 4.70 | 18.20 | 11.40 | 47.90 | 25.50 | 0.60 | 0.30 | 0.80 | 0.00 | 1.30 | 1.70 | 21.70 | 26,156.00 | 1.70 | 280,291.00 | 41.60 | 119,000.00 | 230,101.00 | 2.30 | 80.50 | 2.60 | 83.80 | 71.50 | 83.30 | 20.60 | 6.40 | 3.50 | 1,564,272.00 | 309,832.00 | 4,439,933.00 | 17.20 | 45,081.00 | 24,369.00 | 7.30 | 53,042.00 | 62,427.00 | 30,039.00 | 19,344.00 | 2,354.00 | 29,521.00 | 5,070.00 | 51,353.00 | 1.20 | 61.05 | 
| 25% | 1.95 | 1,816,411.00 | 1,696,961.50 | 5.70 | 21.05 | 16.10 | 50.20 | 71.25 | 3.65 | 0.50 | 1.80 | 0.10 | 2.10 | 5.30 | 57.40 | 116,811.50 | 4.75 | 801,166.00 | 64.10 | 159,750.00 | 681,296.50 | 2.46 | 83.95 | 7.30 | 88.80 | 80.70 | 87.30 | 27.85 | 7.85 | 7.30 | 4,171,798.50 | 24,553,072.00 | 23,908,598.50 | 22.25 | 55,560.50 | 30,276.50 | 10.10 | 121,633.00 | 143,106.00 | 73,924.00 | 44,787.00 | 14,722.00 | 126,135.00 | 14,892.50 | 120,076.50 | 45.80 | 33,334.51 | 
| 50% | 4.10 | 4,505,836.00 | 4,339,367.00 | 6.00 | 22.10 | 16.90 | 50.70 | 80.60 | 8.50 | 0.70 | 3.00 | 0.10 | 2.40 | 9.80 | 71.10 | 270,775.00 | 7.20 | 2,006,358.00 | 66.30 | 194,500.00 | 1,734,618.00 | 2.52 | 85.30 | 11.80 | 89.90 | 82.50 | 90.20 | 31.30 | 8.90 | 9.30 | 9,542,068.00 | 81,927,799.00 | 54,869,978.00 | 24.80 | 61,439.00 | 32,176.00 | 11.80 | 302,653.00 | 339,305.00 | 187,845.00 | 123,015.00 | 61,252.00 | 276,269.00 | 36,273.00 | 288,790.00 | 101.20 | 53,624.76 | 
| 75% | 9.80 | 7,428,391.50 | 6,636,084.50 | 6.20 | 23.25 | 17.85 | 51.20 | 86.90 | 16.30 | 1.60 | 5.10 | 0.20 | 2.90 | 13.90 | 79.10 | 459,667.50 | 13.65 | 3,135,492.50 | 69.00 | 271,750.00 | 2,732,946.50 | 2.62 | 86.70 | 21.00 | 91.65 | 84.80 | 91.70 | 34.45 | 10.20 | 12.15 | 17,652,438.00 | 139,960,482.00 | 101,458,882.50 | 27.35 | 71,463.50 | 36,454.00 | 13.50 | 556,870.50 | 579,058.50 | 327,630.50 | 204,164.50 | 128,806.00 | 446,237.00 | 58,167.50 | 497,595.50 | 221.45 | 80,692.73 | 
| max | 17.30 | 39,538,223.00 | 37,253,956.00 | 7.70 | 29.00 | 21.20 | 52.60 | 94.40 | 46.00 | 15.60 | 37.60 | 10.10 | 24.20 | 49.30 | 93.00 | 1,574,531.00 | 26.80 | 14,366,336.00 | 73.20 | 615,300.00 | 13,044,266.00 | 3.12 | 89.80 | 44.20 | 95.30 | 88.30 | 93.60 | 58.50 | 14.00 | 20.80 | 90,830,372.00 | 702,603,073.00 | 481,800,461.00 | 33.60 | 86,420.00 | 56,147.00 | 19.60 | 3,453,769.00 | 3,548,449.00 | 1,852,580.00 | 1,320,085.00 | 1,619,857.00 | 1,819,107.00 | 252,377.00 | 3,176,341.00 | 9,856.50 | 570,640.95 | 
One more thing we can do to clean up this view is to sort the variables by their name.
(
    state_demographics.describe()
    # Sort the data frame by the row index (a.k.a., the row names)
    .sort_index(axis="columns")
    .style.format(precision=2, thousands=",")
)| Age.Percent 65 and Older | Age.Percent Under 18 Years | Age.Percent Under 5 Years | Education.Bachelor's Degree or Higher | Education.High School or Higher | Employment.Firms.Men-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Nonveteran-Owned | Employment.Firms.Total | Employment.Firms.Veteran-Owned | Employment.Firms.Women-Owned | Employment.Nonemployer Establishments | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Black Alone | Ethnicities.Hispanic or Latino | Ethnicities.Native Hawaiian and Other Pacific Islander Alone | Ethnicities.Two or More Races | Ethnicities.White Alone | Ethnicities.White Alone, not Hispanic or Latino | Housing.Homeownership Rate | Housing.Households | Housing.Households with a Internet | Housing.Households with a computer | Housing.Housing Units | Housing.Median Value of Owner-Occupied Units | Housing.Persons per Household | Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | Miscellaneous.Foreign Born | Miscellaneous.Land Area | Miscellaneous.Language Other than English at Home | Miscellaneous.Living in Same House +1 Years | Miscellaneous.Manufacturers Shipments | Miscellaneous.Mean Travel Time to Work | Miscellaneous.Percent Female | Miscellaneous.Percent Under 65 Years Without Health insurance | Miscellaneous.Percent Under 66 Years With a Disability | Miscellaneous.Veterans | Population.2010 Population | Population.2014 Population | Population.Population Percent Change | Population.Population per Square Mile | Sales.Accommodation and Food Services Sales | Sales.Retail Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 49.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 51.00 | 
| mean | 16.88 | 22.14 | 5.96 | 31.77 | 89.54 | 292,380.10 | 156,032.14 | 373,752.12 | 473,326.00 | 545,254.90 | 49,611.88 | 193,842.12 | 519,324.16 | 2.01 | 4.55 | 11.87 | 12.25 | 0.40 | 3.11 | 78.07 | 67.67 | 65.67 | 2,367,765.65 | 82.05 | 89.98 | 2,738,906.75 | 233,176.47 | 2.56 | 63,097.86 | 33,743.06 | 12.17 | 9.41 | 69,253.05 | 14.98 | 85.24 | 116,411,562.24 | 24.80 | 50.60 | 9.96 | 9.16 | 357,457.29 | 6,053,834.08 | 6,499,005.51 | 5.15 | 384.40 | 13,885,070.55 | 82,741,605.31 | 
| std | 2.01 | 2.00 | 0.61 | 6.43 | 2.70 | 352,447.95 | 299,248.47 | 361,584.05 | 589,208.57 | 661,434.25 | 51,941.58 | 246,899.31 | 668,860.54 | 3.11 | 5.52 | 10.70 | 10.35 | 1.41 | 3.21 | 13.02 | 16.18 | 5.44 | 2,531,758.54 | 3.79 | 2.55 | 2,888,415.00 | 109,205.70 | 0.17 | 10,715.13 | 5,689.58 | 2.68 | 6.11 | 85,526.08 | 9.77 | 2.06 | 131,704,461.49 | 3.98 | 0.84 | 3.64 | 1.77 | 351,951.29 | 6,823,984.27 | 7,408,022.55 | 6.87 | 1,377.35 | 16,274,397.82 | 91,289,768.45 | 
| min | 11.40 | 18.20 | 4.70 | 20.60 | 83.30 | 30,039.00 | 2,354.00 | 29,521.00 | 51,353.00 | 62,427.00 | 5,070.00 | 19,344.00 | 53,042.00 | 0.30 | 0.80 | 0.60 | 1.70 | 0.00 | 1.30 | 25.50 | 21.70 | 41.60 | 230,101.00 | 71.50 | 83.80 | 280,291.00 | 119,000.00 | 2.30 | 45,081.00 | 24,369.00 | 7.30 | 1.70 | 61.05 | 2.60 | 80.50 | 309,832.00 | 17.20 | 47.90 | 3.50 | 6.40 | 26,156.00 | 563,626.00 | 576,851.00 | -13.30 | 1.20 | 1,564,272.00 | 4,439,933.00 | 
| 25% | 16.10 | 21.05 | 5.70 | 27.85 | 87.30 | 73,924.00 | 14,722.00 | 126,135.00 | 120,076.50 | 143,106.00 | 14,892.50 | 44,787.00 | 121,633.00 | 0.50 | 1.80 | 3.65 | 5.30 | 0.10 | 2.10 | 71.25 | 57.40 | 64.10 | 681,296.50 | 80.70 | 88.80 | 801,166.00 | 159,750.00 | 2.46 | 55,560.50 | 30,276.50 | 10.10 | 4.75 | 33,334.51 | 7.30 | 83.95 | 24,553,072.00 | 22.25 | 50.20 | 7.30 | 7.85 | 116,811.50 | 1,696,961.50 | 1,816,411.00 | 1.95 | 45.80 | 4,171,798.50 | 23,908,598.50 | 
| 50% | 16.90 | 22.10 | 6.00 | 31.30 | 90.20 | 187,845.00 | 61,252.00 | 276,269.00 | 288,790.00 | 339,305.00 | 36,273.00 | 123,015.00 | 302,653.00 | 0.70 | 3.00 | 8.50 | 9.80 | 0.10 | 2.40 | 80.60 | 71.10 | 66.30 | 1,734,618.00 | 82.50 | 89.90 | 2,006,358.00 | 194,500.00 | 2.52 | 61,439.00 | 32,176.00 | 11.80 | 7.20 | 53,624.76 | 11.80 | 85.30 | 81,927,799.00 | 24.80 | 50.70 | 9.30 | 8.90 | 270,775.00 | 4,339,367.00 | 4,505,836.00 | 4.10 | 101.20 | 9,542,068.00 | 54,869,978.00 | 
| 75% | 17.85 | 23.25 | 6.20 | 34.45 | 91.70 | 327,630.50 | 128,806.00 | 446,237.00 | 497,595.50 | 579,058.50 | 58,167.50 | 204,164.50 | 556,870.50 | 1.60 | 5.10 | 16.30 | 13.90 | 0.20 | 2.90 | 86.90 | 79.10 | 69.00 | 2,732,946.50 | 84.80 | 91.65 | 3,135,492.50 | 271,750.00 | 2.62 | 71,463.50 | 36,454.00 | 13.50 | 13.65 | 80,692.73 | 21.00 | 86.70 | 139,960,482.00 | 27.35 | 51.20 | 12.15 | 10.20 | 459,667.50 | 6,636,084.50 | 7,428,391.50 | 9.80 | 221.45 | 17,652,438.00 | 101,458,882.50 | 
| max | 21.20 | 29.00 | 7.70 | 58.50 | 93.60 | 1,852,580.00 | 1,619,857.00 | 1,819,107.00 | 3,176,341.00 | 3,548,449.00 | 252,377.00 | 1,320,085.00 | 3,453,769.00 | 15.60 | 37.60 | 46.00 | 49.30 | 10.10 | 24.20 | 94.40 | 93.00 | 73.20 | 13,044,266.00 | 88.30 | 95.30 | 14,366,336.00 | 615,300.00 | 3.12 | 86,420.00 | 56,147.00 | 19.60 | 26.80 | 570,640.95 | 44.20 | 89.80 | 702,603,073.00 | 33.60 | 52.60 | 20.80 | 14.00 | 1,574,531.00 | 37,253,956.00 | 39,538,223.00 | 17.30 | 9,856.50 | 90,830,372.00 | 481,800,461.00 | 
That’s a pretty nice looking summary now!
Note: Do you see how we put that little pipeline in parentheses? This is so that we can separate operations on their own line, which can improve readability, and also let us add comments as needed.
I want to make something clear. We haven’t done anything to change the data frame that we imported.
state_demographics.head()| State | Population.Population Percent Change | Population.2014 Population | Population.2010 Population | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Miscellaneous.Percent Female | Ethnicities.White Alone | Ethnicities.Black Alone | ... | Employment.Nonemployer Establishments | Employment.Firms.Total | Employment.Firms.Men-Owned | Employment.Firms.Women-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | Population.Population per Square Mile | Miscellaneous.Land Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Connecticut | -10.2 | 3605944 | 3574097 | 5.1 | 20.4 | 17.7 | 51.2 | 79.7 | 12.2 | ... | 286874 | 326693 | 187845 | 106678 | 56113 | 259614 | 31056 | 281182 | 738.1 | 4842.36 | 
| 1 | Delaware | 8.4 | 989948 | 897934 | 5.6 | 20.9 | 19.4 | 51.7 | 69.2 | 23.2 | ... | 68623 | 73418 | 38328 | 23964 | 14440 | 54782 | 7206 | 60318 | 460.8 | 1948.54 | 
| 2 | District of Columbia | 17.3 | 689545 | 601723 | 6.4 | 18.2 | 12.4 | 52.6 | 46.0 | 46.0 | ... | 62583 | 63408 | 30237 | 27064 | 29983 | 29521 | 5070 | 54217 | 9856.5 | 61.05 | 
| 3 | Florida | 14.2 | 21538187 | 18801310 | 5.3 | 19.7 | 20.9 | 51.1 | 77.3 | 16.9 | ... | 2388050 | 2100187 | 1084885 | 807817 | 926112 | 1121749 | 185756 | 1846686 | 350.6 | 53624.76 | 
| 4 | Georgia | 9.6 | 10711908 | 9687653 | 6.2 | 23.6 | 14.3 | 51.4 | 60.2 | 32.6 | ... | 955621 | 929864 | 480578 | 376506 | 371588 | 538893 | 96787 | 800585 | 168.4 | 57513.49 | 
5 rows × 48 columns
As you see, it’s the same data frame we started with. All the functions we have used so far have returned new data frames. You will see this pattern a lot–many Pandas functions return new data rather modifying existing data.
Why is it a good idea to look at a summary of data that you imported?
Filtering Columns
This table has a lot of different kinds of data about state demographics. The nice thing is that each of the different categories is used as a prefix to the column name, e.g., data about income is prefixed with Income, data about population is prefixed with Population, and so on. We can leverage this labeling scheme to chop our data frame into more manageable chunks.
We can use the filter() function to filter columns in a bunch of different ways. For now we will use the regular expression (regex) argument to specify that we want to match at the start of the column name. For example, getting all the data columns about ethnicities:
(
    state_demographics
    # Use set_index to convert the state column to the row names
    .set_index("State")
    # Use filter() to keep columns matching the given pattern
    .filter(regex=r"^Ethnicities")
    # Only display the first 5 rows
    .head()
)| Ethnicities.White Alone | Ethnicities.Black Alone | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Native Hawaiian and Other Pacific Islander Alone | Ethnicities.Two or More Races | Ethnicities.Hispanic or Latino | Ethnicities.White Alone, not Hispanic or Latino | |
|---|---|---|---|---|---|---|---|---|
| State | ||||||||
| Connecticut | 79.7 | 12.2 | 0.6 | 5.0 | 0.1 | 2.5 | 16.9 | 65.9 | 
| Delaware | 69.2 | 23.2 | 0.7 | 4.1 | 0.1 | 2.7 | 9.6 | 61.7 | 
| District of Columbia | 46.0 | 46.0 | 0.6 | 4.5 | 0.1 | 2.9 | 11.3 | 37.5 | 
| Florida | 77.3 | 16.9 | 0.5 | 3.0 | 0.1 | 2.2 | 26.4 | 53.2 | 
| Georgia | 60.2 | 32.6 | 0.5 | 4.4 | 0.1 | 2.2 | 9.9 | 52.0 | 
Note: If you need a refresher on regular expressions check out Appendix M.
We needed to use the set_index() here so that the state name would still be present on the resulting data frames.
This works for other categories as well. Here it is for Income:
state_demographics.set_index("State").filter(regex=r"^Income").head()| Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | |
|---|---|---|---|
| State | |||
| Connecticut | 78444 | 44496 | 10.0 | 
| Delaware | 68287 | 35450 | 11.3 | 
| District of Columbia | 86420 | 56147 | 13.5 | 
| Florida | 55660 | 31619 | 12.7 | 
| Georgia | 58700 | 31067 | 13.3 | 
We can pass these filtered tables to describe() and other functions as well:
state_demographics.set_index("State").filter(regex=r"^Income").describe()| Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | |
|---|---|---|---|
| count | 51.000000 | 51.000000 | 51.000000 | 
| mean | 63097.862745 | 33743.058824 | 12.170588 | 
| std | 10715.134497 | 5689.577086 | 2.678006 | 
| min | 45081.000000 | 24369.000000 | 7.300000 | 
| 25% | 55560.500000 | 30276.500000 | 10.100000 | 
| 50% | 61439.000000 | 32176.000000 | 11.800000 | 
| 75% | 71463.500000 | 36454.000000 | 13.500000 | 
| max | 86420.000000 | 56147.000000 | 19.600000 | 
This is another way that we can start to get a feel for our data.
Why might filtering columns by category prefixes (like “Population” or “Income”) be useful during exploratory data analysis?
Exploring Your Data
Now that we have a basic idea of what our data looks like, we can start to explore it a bit more. The best place to start is to actually look at the data. Pandas gives you the ability to create basic charts without having to use a 3rd-party package. As long as you don’t want anything too complex, it will be fine to start with.
Basic Population Plots
Let’s start by plotting some basic state population info. We can use Panda’s plot() function for this:
(
    state_demographics
    # Take only the columns that start with Population.201X, where X is some digit
    # E.g., Population.2010, or Population.2014.
    .filter(regex=r"^Population\.201\d")
    # Draw a bar plot
    .plot(kind="bar")
)
That’s not bad, but the axes are a bit weird. Let’s adjust them. The simplest way to do that is to be more specific about which data we will need in our chart. Then, we can explicitly set the x and y axes.
# Make a list of the columns that we want to keep
columns = ["State", "Population.2010 Population", "Population.2014 Population"]
# Use the "bracket notation" to select only those columns specified in the list we just
# created.
plot_data = (
    state_demographics[columns]
    # Rename the population columns to something shorter.
    # It will make the chart legends look nicer.
    .rename(
        # We want to rename columns.  The keys of this dictionary are the old column
        # names, and the values are the new column names.
        columns={
            "Population.2010 Population": "2010 Population",
            "Population.2014 Population": "2014 Population",
        }
    )
)
# Plot the data subset
plot_data.plot(
    # Make it a bar chart
    kind="bar",
    # Put "State" on the x-axis
    x="State",
    # And put both population columns on the y-axis
    y=["2010 Population", "2014 Population"],
    figsize=(8, 3),
)
Not bad! The plotting function was smart enough to put both population data series on the chart, and to include a nice legend so that we can tell them apart.
At least we have the State names on the x-axis now, but they are pretty smooshed together. There are a bunch of ways we could fix it:
- Shrink the label size
- Adjust the chart proportions
- Use a horizontal bar chart (and adjust the proportions)
Shrinking the label size:
plot_data.plot(
    kind="bar",
    x="State",
    y=["2010 Population", "2014 Population"],
    fontsize=6,
)
That works, but now the labels are tiny, and it shrunk both the x and y axis labels. You can shrink just the x-axis tick labels, but to do so you need to “eject” out of the pandas API and drop down into matplotlib code.
import matplotlib.pyplot as plt
# Create your plot
ax = plot_data.plot(
    kind="bar",
    x="State",
    y=["2010 Population", "2014 Population"],
)
# Adjust only x-axis tick labels
# 8 is the fontsize, adjust it as needed
ax.tick_params(axis="x", labelsize=8)
# Show the chart
plt.show()
Don’t worry too much about the details of this. I just wanted to show you that the plots returned by Pandas are really matplotlib objects, and can be interacted with in the usual way when required.
Let’s adjust the figure proportions next:
plot_data.plot(
    kind="bar",
    x="State",
    y=["2010 Population", "2014 Population"],
    # Set the width to 8 units, and the height to 3
    figsize=(8, 3),
    # Set the font size of all tick labels to 9
    fontsize=9
)
That’s fairly readable. Above, I mentioned how long data and plots tend to fit better on the screen both in your reports and when actively doing the analysis. So let’s switch to a horizontal bar chart. That way, we can give the states a little more room on the plot.
plot_data.plot(
    # Draw a horizontal bar chart. Note the `h` at the end of `barh`.
    kind="barh",
    x="State",
    y=["2010 Population", "2014 Population"],
    figsize=(5, 8),
)
Note: You might find it a little weird that we still specify State as the x values and Population as the y values, even thought the chart shows the on the opposite axis. Just roll with it :)
That’s looking pretty good now! You might think it is a little bit weird for the state names to be going in reverse alphabetical order as you go down the page. I suppose this is chosen because in “normal” math plots, the origin is 0, and the values “increase” as you move away from the origin. However, it just feels weird for it to do this when the y-axis is categorical rather than continuous. So let’s reverse it.
(
    plot_data
    # Reverse sort the rows based on the State column
    .sort_values("State", ascending=False).plot(
        kind="barh",
        x="State",
        y=["2010 Population", "2014 Population"],
        figsize=(5, 8),
    )
)
One last thing. Let’s add an axis label to the x-axis. I know there is a legend there, but I think it is still good practice to label all axes in a plot.
(
    plot_data
    # Reverse sort the rows based on the State column
    .sort_values("State", ascending=False).plot(
        kind="barh",
        x="State",
        y=["2010 Population", "2014 Population"],
        figsize=(5, 8),
        xlabel="Population"
    )
)
That’s what I’m talking about! As you can see, with just a few lines of code, you can make totally reasonable looking plots.
Depending on your use case, you might want to sort the data so that the bars are always decreasing. That way, it’s easier for the viewer to look at overall trends, rather than being able to quickly pick out specific states.
To do this we will sort by population. However, we are plotting two different years on the chart, so we need to decide which way to sort it. Reasonable options might be:
- Pick one of the years and sort by that one
- Sort by the mean
- Sort by the max
Any of them could work depending on your situation, but let’s keep it simple and sort by the 2014 population:
(
    plot_data
    # Reverse sort the rows based on the State column
    .sort_values("2014 Population").plot(
        kind="barh",
        x="State",
        y=["2010 Population", "2014 Population"],
        figsize=(5, 8),
        xlabel="Population",
    )
)
Why might it be important to have clean, professional looking data visualizations, even during the exploratory data analysis phase of your project?
Percent Population Change
Notice any trends? One thing that we can kind of see is that there are some states that look like they had bigger population changes than other states. One of our data columns already tracks this: Population.Population Percent Change. Let’s take a look at the basic plot. We will use a lot of the same options as we did in the last one.
state_demographics.plot(
    kind="barh",
    x="State",
    y="Population.Population Percent Change",
    figsize=(5, 8),
    xlabel="Population",
)
This plot is pretty good, but we can do better with a bit more effort. Here’s a couple things that will improve it:
- Remove the legend since there is only one series to plot
- Adjust the bar color so that population increase blue and decrease is orange
- Sort states from most increase to most decrease
plot_data = state_demographics[
    ["State", "Population.Population Percent Change"]
].sort_values("Population.Population Percent Change")
# Create list with colors based on positive/negative values
colors = [
    # Negative values will be orange, positive values will be blue
    "tab:orange" if x < 0 else "tab:blue"
    # Do this calculation for each value in the series that we want to plot.
    for x in plot_data["Population.Population Percent Change"]
]
plot_data.plot(
    kind="barh",
    x="State",
    y="Population.Population Percent Change",
    figsize=(5, 8),
    xlabel="Population Change (%)",
    # Specify the list of colors to use
    color=colors,
    # Remove the legend
    legend=False,
)
# Put a gray line at x=0 to help guide the viewer's attention.
plt.axvline(x=0, color="#666666")
To adjust the colors, we had to create a list of colors the same length as the data that we wanted to plot. In this way, each row will be given its correct color.
Note: tab:blue and tab:orange are built-in colors in matplotlib.
Note: You can specify these labeled arguments in any order.
That’s really not a bad plot! I wonder if the states with the biggest population changes were already some of the most populous states?
state_demographics.plot(
    kind="scatter",
    x="Population.2010 Population",
    y="Population.Population Percent Change",
)
Doesn’t look like a super strong trend there, but let’s log the x scale as maybe orders of magnitude matter here.
state_demographics.plot(
    kind="scatter",
    x="Population.2010 Population",
    y="Population.Population Percent Change",
    # Log the x axis
    logx=True,
)
Nope, nothing there.
When might you use a logarithmic scale on either axis of a plot, and what insights can this reveal that a linear scale might not?
Do you think any of the state demographic data is correlated with the percent population change? To try and answer this question, we can calculate all the correlation values for the columns in the state demographics data frame.
# Get the correlation between columns in the data frame
correlation_matrix = state_demographics.corr(
    # Restrict the calculation to only numeric columns
    numeric_only=True
)What can correlation values tell us during exploratory data analysis, and what are their limitations?
Now that we have the correlation values, let’s take a look at them. First, we need to make a data frame with the data in a nice format that makes it easy to plot the data.
# Display the correlations values between the percent population change and the other
# variables.
plot_data = (
    # Select only the percent change column
    correlation_matrix[["Population.Population Percent Change"]]
    # Then sort the rows based on their correlation to the percent population change
    .sort_values("Population.Population Percent Change")
    # Remove the Population.Population Percent Change row from the results, since
    # we don't care about the "self-correlation"
    .drop("Population.Population Percent Change")
    # Convert the row names to a column called "Variable"
    .reset_index(names="Variable")
    # Rename the 2nd column for a nicer looking chart
    .rename(columns={"Population.Population Percent Change": "Correlation"})
)
plot_data| Variable | Correlation | |
|---|---|---|
| 0 | Miscellaneous.Living in Same House +1 Years | -0.690011 | 
| 1 | Age.Percent 65 and Older | -0.423079 | 
| 2 | Housing.Homeownership Rate | -0.343682 | 
| 3 | Miscellaneous.Percent Under 66 Years With a Di... | -0.306390 | 
| 4 | Ethnicities.White Alone, not Hispanic or Latino | -0.269663 | 
| ... | ... | ... | 
| 41 | Housing.Persons per Household | 0.277861 | 
| 42 | Age.Percent Under 18 Years | 0.323823 | 
| 43 | Miscellaneous.Percent Under 65 Years Without H... | 0.330227 | 
| 44 | Housing.Households with a computer | 0.455228 | 
| 45 | Age.Percent Under 5 Years | 0.459584 | 
46 rows × 2 columns
Note: earlier we used drop() to remove columns. Now you see that it can also be used to drop rows, depending on the arguments used. You will find Pandas has a lot of functions like this.
Next, we can create a list of colors for the bars of the plot. We want to color variables with a positive correlation to the percent population change blue, and a negative correlation orange.
# Create list with colors based on positive/negative values
colors = [
    # Negative values will be orange, positive values will be blue
    "tab:orange" if x < 0 else "tab:blue"
    # Do this calculation for each correlation value in the series.
    for x in plot_data["Correlation"]
]Finally, let’s make the plot. The code will be very similar to the previous plots we have made.
plot_data.plot(
    kind="barh",
    x="Variable",
    y="Correlation",
    figsize=(5, 10),
    xlabel="Correlation with Percent Population Change",
    legend=False,
    color=colors,
)
plt.axvline(x=0, color="#666666")
That’s nice! Do you see any interesting trends?
- Age data
- Increasing proportion of young population is strongly correlated with positive population change
- Increasing proportion of elderly population is highly correlated with decreasing population.
- Makes sense…
 
- Ethnicity data
- These are all positively correlated with population change: “Hispanic or Latino”, “Languages other than English at home”, “Foreign Born”.
- In contrast, “White Alone” and “White Alone, not Hispanic or Latino” are negatively correlated with population change.
- Is this suggesting immigration is driving some of the population growth?
 
- Education & Income data
- “Households with a computer”, “Households with internet”, “Bachelor’s Degree or Higher” are positively correlated with population change
- “Persons Below Poverty Level” is negatively correlated with population change
- Maybe it’s suggesting people are moving to more educated or prosperous areas?
 
These are all avenues that you might want to take a look at if this data was important to your research.
As you can see, there are a lot variables there and most of them have pretty weak correlation. This is a good time to show you how to filter rows based on some criteria of the data. Let’s filter out any rows that have correlation values between -0.15 and 0.15. To do this, we can use query():
plot_data_with_some_correlation = plot_data.query(
    "Correlation < -0.15 or Correlation > 0.15"
)The query() function is very cool and highly flexible and dynamic. We will see a few more examples of it later in the chapter. For now, just know that you can access columns of your data frame and use them to filter rows in a natural looking way.
Once we have the filtered data, we can regenerate the color list, and plot the data.
# We need to redo the colors again.
colors = [
    "tab:orange" if x < 0 else "tab:blue"
    for x in plot_data_with_some_correlation["Correlation"]
]
plot_data_with_some_correlation.plot(
    kind="barh",
    x="Variable",
    y="Correlation",
    figsize=(5, 10),
    xlabel="Correlation with Percent Population Change",
    legend=False,
    color=colors,
)
plt.axvline(x=0, color="#666666")
It’s generally not a good idea to take correlation values at face value: it doesn’t measure all types of dependencies and it can be tempting to assign causation to things that have high correlation. So for correlation, it’s always a good idea to look at your data whenever possible. Let’s do that now with some of the most highly correlated or anti-correlated variables. We are going to use seaborn for this, as it makes it super simple to compare multiple variables in a single plot. We can even put regression lines with confidence intervals on the plots by setting the kind="reg" argument!
Here is a plot containing some of the most highly correlated variables:
columns = [
    "Population.Population Percent Change",
    "Age.Percent Under 5 Years",
    "Housing.Households with a computer",
    "Ethnicities.Hispanic or Latino",
]
sns.pairplot(
    state_demographics[columns],
    kind="reg",
    x_vars=columns[0],
    y_vars=columns[1:],
    height=4,
)
And one with the most highly anti-correlated variables:
columns = [
    "Population.Population Percent Change",
    "Ethnicities.White Alone, not Hispanic or Latino",
    "Age.Percent 65 and Older",
    "Miscellaneous.Living in Same House +1 Years",
]
sns.pairplot(
    state_demographics[columns],
    kind="reg",
    x_vars=columns[0],
    y_vars=columns[1:],
    height=4,
)
Neat, at least we have discovered a couple of variables that we may want to look into. One definite potential issue I see here is that the states that have dropped in population seem to be off on their own in all these plots. It would probably be a good idea to see if we are violating any major assumptions of the basic linear model with them, or at least see if they (or any other points) have too much leverage and are misleading us. But that is a topic for a different course!
Before we move on, let’s do one more thing with correlation as it is so common: a correlation heatmap! We can use seaborn’s clustermap() for this:
clustermap_plot = sns.clustermap(
    correlation_matrix,
    # Specify the complete linkage for calculating clusters
    method="complete",
    # The relative space the dendrograms will occupy
    dendrogram_ratio=0.05,
    # Use the "icefire" diverging palette
    cmap="icefire",
    # Make sure the min color value occurs at -1
    vmin=-1,
    # Make sure the max color value occurs at 1
    vmax=1,
    figsize=(12, 12),
    # Remove the x-axis tick labels
    xticklabels=True,
    # Remove the y-axis tick labels
    yticklabels=True,
    # Set the options for the color palette legend
    cbar_kws={
        "label": "Correlation",  # Set the label for the color palette legend
        "location": "bottom",  # Set the location of the color palette legend
    },
    # Set the location for the color palette legend
    # This is for top left
    # cbar_pos=(
    #     0.03,  # Distance from the left
    #     0.92,  # Distance from the bottom
    #     0.10,  # Width
    #     0.05,  # Height
    # ),
    cbar_pos=(
        0.65,  # Distance from the left
        0.20,  # Distance from the bottom
        0.25,  # Width
        0.14,  # Height
    ),
)
Cool! There are a couple things to note about this:
- One important consideration is setting the minimum and maximum values of your color palette. While you don’t always need to adjust these parameters (and sometimes only need to set one or two of them), being aware of this option is important. Making these adjustments ensures the most informative part of your color palette corresponds to the most relevant range in your data.
- For fine-tuning your colorbar, check out the cbar_kwsparameter, which passes arguments directly to matplotlib’s colorbar() method. This pattern of documentation referral is something you’ll encounter frequently in the Python ecosystem. Libraries will often direct you to another component’s documentation for parameter details, especially when they’re simply passing those arguments through to underlying functions.
- Don’t forget about customizing your clustering approach! The linkage type and distance metric can impact your hierarchical clustering results. The SciPy documentation provides comprehensive details on these options, allowing you to select methods that best represent the relationships in your data.
Why is it important to ensure that your color palettes represent the correct data? For example,
- How would it change your interpretation if the center of the palette (the black part) was on 0.2 rather than zero?
- How would it change your interpretation if the brightest blue was -0.2 but the brightest orange was 1.0?
There is a lot to unpack with this figure, but the most obvious thing that I see is that blob of bright orange in the bottom right. If you look at the data in those columns, you will see that they are numbers with real magnitude that will be pretty highly influenced by the number of people in the state. A lot of the other columns are not like this. Wouldn’t it be interesting to take those counting-style numbers and normalize them by the state population? E.g., something like manufacturers shipments per 10k people. Let’s do that now.
# Make a copy of the state demographics data
normalized_state_demographics = state_demographics.copy()
# These are the columns that we want to divide by the population
columns = [
    "Employment.Firms.Men-Owned",
    "Employment.Firms.Minority-Owned",
    "Employment.Firms.Nonminority-Owned",
    "Employment.Firms.Nonveteran-Owned",
    "Employment.Firms.Total",
    "Employment.Firms.Veteran-Owned",
    "Employment.Firms.Women-Owned",
    "Employment.Nonemployer Establishments",
    "Housing.Households",
    "Housing.Housing Units",
    "Miscellaneous.Manufacturers Shipments",
    "Miscellaneous.Veterans",
    "Sales.Accommodation and Food Services Sales",
    "Sales.Retail Sales",
]
# Loop through each of the columns
for column in columns:
    # Normalize the data: X / Population * 10,000 people
    normalized_data = (
        normalized_state_demographics[column]
        / state_demographics["Population.2010 Population"]
        * 10_000
    )
    # Replace the original column with the normalized column
    normalized_state_demographics[column] = normalized_dataNow that we have normalized data, we can generate the correlation matrix and plot the heatmap.
# Generate a correlation matrix of the numeric columns
correlation_matrix = normalized_state_demographics.corr(numeric_only=True)
# Draw the clustered heatmap
clustermap_plot = sns.clustermap(
    correlation_matrix,
    # Specify the complete linkage for calculating clusters
    method="complete",
    # The relative space the dendrograms will occupy
    dendrogram_ratio=0.05,
    # Use the "icefire" diverging palette
    cmap="icefire",
    # Make sure the min color value occurs at -1
    vmin=-1,
    # Make sure the max color value occurs at 1
    vmax=1,
    figsize=(12, 12),
    # Remove the x-axis tick labels
    xticklabels=True,
    # Remove the y-axis tick labels
    yticklabels=True,
    # Set the options for the color palette legend
    cbar_kws={
        "label": "Correlation",  # Set the label for the color palette legend
        "location": "bottom",  # Set the location of the color palette legend
    },
    # Set the location for the color palette legend
    cbar_pos=(
        0.65,  # Distance from the left
        0.20,  # Distance from the bottom
        0.25,  # Width
        0.14,  # Height
    ),
)
After “normalizing out” the effect of population on a bunch of the variables, we can see some trends that were a bit masked before. For example, we can now see some correlation between education, wealth, and income, as well as some potentially interesting trends around ethnicity and age.
What insights can a correlation heatmap with clustering provide that a simple correlation table cannot?
State Demographics Wrap-Up
We went over a lot of material in this section! I hope it gave you a sense of how exploratory data analysis can go using Pandas and Seaborn: you begin by exploring the data to get a sense of it, identify patterns and trends, and then dive deeper into those patterns to better understand the system you’re analyzing.
7.4 Cancer data
Now that you have a basic understanding of using Pandas for EDA, let’s take a look at some public health data.
We’re going to look at the cancer data from CORGIS. This dataset contains information about cancer deaths between 2007 and 2013 in each state. Specifically, these are deaths from breast, lung, and colorectal cancer. In addition to the total death rate, the rates have also been broken down by age, race, and sex.
Basic Info
To start, we need to read the CSV file with the cancer data.
cancer = pd.read_csv("../_data/cancer.csv")
cancer.head()| State | Total.Rate | Total.Number | Total.Population | Rates.Age.< 18 | Rates.Age.18-45 | Rates.Age.45-64 | Rates.Age.> 64 | Rates.Age and Sex.Female.< 18 | Rates.Age and Sex.Male.< 18 | ... | Types.Lung.Age and Sex.Male.45 - 64 | Types.Lung.Age and Sex.Female.> 64 | Types.Lung.Age and Sex.Male.> 64 | Types.Lung.Race.White | Types.Lung.Race.White non-Hispanic | Types.Lung.Race.Black | Types.Lung.Race.Black non-Hispanic | Types.Lung.Race.Asian | Types.Lung.Race.Indigenous | Types.Lung.Race.Hispanic | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 214.2 | 71529.0 | 33387205.0 | 2.0 | 18.5 | 244.7 | 1017.8 | 2.0 | 2.1 | ... | 102.9 | 221.7 | 457.4 | 59.9 | 60.4 | 52.6 | 52.8 | 23.0 | 22.9 | 14.8 | 
| 1 | Alaska | 128.1 | 6361.0 | 4966180.0 | 1.7 | 11.8 | 170.9 | 965.2 | 0.0 | 0.0 | ... | 50.3 | 268.3 | 335.0 | 48.7 | 49.5 | 45.6 | 47.9 | 33.0 | 74.4 | 0.0 | 
| 2 | Arizona | 165.6 | 74286.0 | 44845598.0 | 2.5 | 13.6 | 173.6 | 840.2 | 2.6 | 2.5 | ... | 47.0 | 191.9 | 275.8 | 39.5 | 42.2 | 38.2 | 40.4 | 21.3 | 11.1 | 21.6 | 
| 3 | Arkansas | 223.9 | 45627.0 | 20382448.0 | 2.3 | 17.6 | 250.1 | 1048.3 | 2.6 | 2.0 | ... | 106.5 | 248.7 | 484.7 | 63.4 | 64.2 | 62.9 | 63.0 | 18.1 | 16.2 | 14.6 | 
| 4 | California | 150.9 | 393980.0 | 261135696.0 | 2.6 | 13.7 | 163.7 | 902.4 | 2.4 | 2.8 | ... | 36.8 | 192.5 | 269.0 | 37.2 | 42.6 | 46.5 | 48.6 | 25.8 | 18.4 | 18.3 | 
5 rows × 75 columns
Like before, we will use describe() to get a basic overview of the numeric data columns.
cancer.describe().style.format(precision=1)| Total.Rate | Total.Number | Total.Population | Rates.Age.< 18 | Rates.Age.18-45 | Rates.Age.45-64 | Rates.Age.> 64 | Rates.Age and Sex.Female.< 18 | Rates.Age and Sex.Male.< 18 | Rates.Age and Sex.Female.18 - 45 | Rates.Age and Sex.Male.18 - 45 | Rates.Age and Sex.Female.45 - 64 | Rates.Age and Sex.Male.45 - 64 | Rates.Age and Sex.Female.> 64 | Rates.Age and Sex.Male.> 64 | Rates.Race.White | Rates.Race.White non-Hispanic | Rates.Race.Black | Rates.Race.Asian | Rates.Race.Indigenous | Rates.Race and Sex.Female.White | Rates.Race and Sex.Female.White non-Hispanic | Rates.Race and Sex.Female.Black | Rates.Race and Sex.Female.Black non-Hispanic | Rates.Race and Sex.Female.Asian | Rates.Race and Sex.Female.Indigenous | Rates.Race and Sex.Male.White | Rates.Race and Sex.Male.White non-Hispanic | Rates.Race and Sex.Male.Black | Rates.Race and Sex.Male.Black non-Hispanic | Rates.Race and Sex.Male.Asian | Rates.Race and Sex.Male.Indigenous | Rates.Race.Hispanic | Rates.Race and Sex.Female.Hispanic | Rates.Race and Sex.Male.Hispanic | Types.Breast.Total | Types.Breast.Age.18 - 44 | Types.Breast.Age.45 - 64 | Types.Breast.Age.> 64 | Types.Breast.Race.White | Types.Breast.Race.White non-Hispanic | Types.Breast.Race.Black | Types.Breast.Race.Black non-Hispanic | Types.Breast.Race.Asian | Types.Breast.Race.Indigenous | Types.Breast.Race.Hispanic | Types.Colorectal.Total | Types.Colorectal.Age and Sex.Female.18 - 44 | Types.Colorectal.Age and Sex.Male.18 - 44 | Types.Colorectal.Age and Sex.Female.45 - 64 | Types.Colorectal.Age and Sex.Male.45 - 64 | Types.Colorectal.Age and Sex.Female.> 64 | Types.Colorectal.Age and Sex.Male.> 64 | Types.Colorectal.Race.White | Types.Colorectal.Race.White non-Hispanic | Types.Colorectal.Race.Black | Types.Colorectal.Race.Black non-Hispanic | Types.Colorectal.Race.Asian | Types.Colorectal.Race.Indigenous | Types.Colorectal.Race.Hispanic | Types.Lung.Total | Types.Lung.Age and Sex.Female.18 - 44 | Types.Lung.Age and Sex.Male.18 - 44 | Types.Lung.Age and Sex.Female.45 - 64 | Types.Lung.Age and Sex.Male.45 - 64 | Types.Lung.Age and Sex.Female.> 64 | Types.Lung.Age and Sex.Male.> 64 | Types.Lung.Race.White | Types.Lung.Race.White non-Hispanic | Types.Lung.Race.Black | Types.Lung.Race.Black non-Hispanic | Types.Lung.Race.Asian | Types.Lung.Race.Indigenous | Types.Lung.Race.Hispanic | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 51.0 | 
| mean | 190.7 | 78723.7 | 42401510.5 | 2.1 | 14.8 | 197.6 | 980.9 | 1.7 | 2.0 | 16.0 | 13.5 | 177.8 | 218.3 | 826.7 | 1181.8 | 171.0 | 173.2 | 182.8 | 99.0 | 111.1 | 145.6 | 147.7 | 142.7 | 146.6 | 85.0 | 88.7 | 206.4 | 208.7 | 222.9 | 228.2 | 107.2 | 129.6 | 98.2 | 82.2 | 115.5 | 26.0 | 4.0 | 34.9 | 102.3 | 21.2 | 21.6 | 22.6 | 23.2 | 6.0 | 5.0 | 8.6 | 17.5 | 1.1 | 1.5 | 14.7 | 21.4 | 81.8 | 101.5 | 15.2 | 15.4 | 17.0 | 17.4 | 7.0 | 6.9 | 8.2 | 53.2 | 1.3 | 1.4 | 45.6 | 64.8 | 224.7 | 355.0 | 48.2 | 49.3 | 44.1 | 44.5 | 19.3 | 27.3 | 16.2 | 
| std | 28.6 | 80861.3 | 47842444.3 | 0.5 | 2.2 | 31.3 | 75.2 | 0.8 | 0.9 | 2.5 | 2.0 | 23.0 | 41.0 | 65.4 | 105.2 | 15.0 | 15.0 | 48.7 | 21.6 | 68.3 | 11.4 | 11.5 | 58.1 | 58.2 | 23.2 | 61.7 | 21.5 | 21.7 | 83.5 | 83.1 | 41.7 | 90.6 | 24.1 | 24.5 | 36.2 | 3.1 | 0.9 | 4.9 | 9.4 | 1.2 | 1.5 | 12.5 | 12.7 | 5.5 | 8.1 | 5.8 | 2.7 | 0.6 | 0.7 | 2.6 | 4.2 | 9.5 | 10.9 | 1.7 | 1.8 | 9.2 | 9.3 | 5.4 | 9.9 | 5.1 | 12.6 | 0.8 | 0.8 | 11.2 | 21.3 | 33.3 | 69.9 | 9.4 | 9.3 | 20.5 | 21.4 | 10.4 | 26.4 | 8.0 | 
| min | 98.5 | 6361.0 | 3931624.0 | 0.0 | 10.0 | 132.3 | 735.8 | 0.0 | 0.0 | 10.3 | 9.8 | 126.1 | 138.5 | 611.6 | 884.9 | 127.8 | 129.1 | 0.0 | 0.0 | 0.0 | 109.9 | 110.5 | 0.0 | 0.0 | 0.0 | 0.0 | 145.1 | 146.2 | 0.0 | 0.0 | 0.0 | 0.0 | 39.5 | 0.0 | 0.0 | 17.4 | 0.0 | 27.8 | 62.3 | 17.5 | 18.2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 0.0 | 0.0 | 10.1 | 14.5 | 59.7 | 72.4 | 10.0 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15.6 | 0.0 | 0.0 | 18.7 | 26.0 | 92.7 | 153.2 | 20.4 | 20.6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 
| 25% | 176.5 | 20631.0 | 11869909.5 | 2.0 | 13.4 | 175.0 | 943.5 | 1.8 | 2.0 | 14.1 | 12.4 | 162.9 | 187.9 | 803.5 | 1112.8 | 162.8 | 165.7 | 162.6 | 90.7 | 58.0 | 140.4 | 143.6 | 135.6 | 146.3 | 79.5 | 43.7 | 195.4 | 197.0 | 200.1 | 216.6 | 96.4 | 66.6 | 82.4 | 72.5 | 91.5 | 24.5 | 3.5 | 31.2 | 98.7 | 20.4 | 20.8 | 21.1 | 23.6 | 0.0 | 0.0 | 0.0 | 16.1 | 1.0 | 1.4 | 13.2 | 18.2 | 74.5 | 93.3 | 14.3 | 14.5 | 15.1 | 16.4 | 0.0 | 0.0 | 5.9 | 46.5 | 0.9 | 0.9 | 38.8 | 49.0 | 209.6 | 313.3 | 42.8 | 43.7 | 38.2 | 40.0 | 18.4 | 4.5 | 13.7 | 
| 50% | 196.1 | 54930.0 | 30348057.0 | 2.2 | 14.6 | 189.3 | 999.6 | 2.0 | 2.3 | 16.0 | 13.4 | 172.2 | 208.3 | 845.5 | 1195.0 | 171.3 | 173.9 | 194.4 | 97.6 | 102.2 | 146.2 | 149.9 | 161.7 | 165.6 | 85.6 | 78.7 | 205.4 | 207.8 | 246.2 | 252.7 | 112.5 | 114.2 | 98.0 | 84.8 | 119.4 | 26.6 | 4.2 | 34.6 | 102.2 | 21.1 | 21.4 | 27.4 | 28.4 | 8.5 | 0.0 | 10.5 | 17.7 | 1.3 | 1.7 | 14.3 | 21.0 | 82.4 | 102.1 | 15.2 | 15.2 | 21.2 | 21.5 | 9.1 | 0.0 | 9.0 | 52.4 | 1.4 | 1.4 | 45.7 | 59.7 | 225.8 | 342.6 | 48.1 | 48.8 | 47.0 | 48.6 | 22.2 | 20.9 | 18.1 | 
| 75% | 210.8 | 93328.0 | 46503256.0 | 2.4 | 16.2 | 217.8 | 1031.4 | 2.2 | 2.5 | 17.7 | 14.4 | 193.3 | 243.3 | 871.1 | 1257.8 | 180.2 | 183.4 | 220.8 | 112.7 | 162.1 | 152.4 | 155.2 | 181.7 | 182.8 | 94.8 | 127.7 | 218.1 | 220.2 | 278.7 | 282.1 | 129.1 | 198.4 | 112.2 | 95.5 | 136.9 | 27.6 | 4.7 | 37.7 | 107.8 | 22.1 | 22.2 | 31.1 | 31.9 | 10.7 | 10.6 | 12.2 | 19.4 | 1.5 | 1.9 | 15.3 | 23.5 | 88.7 | 110.7 | 16.6 | 16.7 | 23.2 | 23.5 | 10.6 | 13.9 | 11.6 | 61.8 | 1.8 | 1.8 | 53.4 | 77.6 | 248.2 | 391.0 | 54.5 | 55.0 | 60.0 | 60.3 | 25.5 | 44.2 | 20.4 | 
| max | 254.6 | 393980.0 | 261135696.0 | 2.7 | 20.3 | 263.9 | 1110.2 | 2.9 | 3.2 | 22.8 | 19.0 | 233.2 | 310.0 | 916.2 | 1372.7 | 204.9 | 205.9 | 235.1 | 149.0 | 248.0 | 170.5 | 171.2 | 195.6 | 198.5 | 130.3 | 219.4 | 253.6 | 255.1 | 319.5 | 321.2 | 172.2 | 303.3 | 168.5 | 140.8 | 202.3 | 31.8 | 5.6 | 56.6 | 129.6 | 24.0 | 25.5 | 34.7 | 35.0 | 15.8 | 27.4 | 18.2 | 24.4 | 2.0 | 3.0 | 21.9 | 34.1 | 99.8 | 124.6 | 19.1 | 19.2 | 26.7 | 27.2 | 18.0 | 34.7 | 18.0 | 81.3 | 2.9 | 3.1 | 77.1 | 112.6 | 298.3 | 519.1 | 71.5 | 71.9 | 73.2 | 73.6 | 33.9 | 87.5 | 35.2 | 
Effect of Population
In the state demographics data, we saw a strong dependency between some of the variables and state population. We would also expect there to be a pretty strong dependency between the total number of cancer deaths and the total state population. Let’s see if that is the case:
cancer.plot(
    kind="scatter",
    x="Total.Population",
    y="Total.Number",
    loglog=True,
)
Just to make it super clear, let’s do the same plot, but this time use the rate of cancer deaths per 100k people rather than the raw totals. (I bet you can guess how it will look!)
cancer.plot(
    kind="scatter",
    x="Total.Population",
    y="Total.Rate",
    logx=True,
)
Because of this, we will use the rates per 100k people rather than total numbers for this section.
Comparing States
Let’s see if there are any high-level differences between individual states and rates of cancer deaths.
(
    cancer
    # Sort the values by the rate of cancer deaths
    .sort_values("Total.Rate").plot(
        # Make a horizontal bar chart
        kind="barh",
        x="State",
        y="Total.Rate",
        # Adjust the figure size so the labels print nicely
        figsize=(5, 8),
        # Give an informative x-axis label
        xlabel="Cancer Rate (per 100k people)",
        # Don't bother with the legend as we only have one data series to plot
        legend=False,
    )
)
There is about a 2.5 times difference between the state with the highest rate of cancer deaths (West Virginia) as compared to the state with the lowest (Utah).
Let’s make a boxplot to see the spread of the data.
cancer.plot(kind="box", y="Total.Rate")
Cool, so we see some variation in the rate of cancer deaths across states. Let’s try and find out if there are any variables in the state demographics data that are correlated with death rates for different types of cancer.
Cancer Deaths and Demographics
This dataset has data for three types of cancer, breast cancer, colorectal cancer, and lung cancer, so we will want to pull out those columns.
cancer_death_rates = cancer[
    ["State", "Types.Breast.Total", "Types.Colorectal.Total", "Types.Lung.Total"]
]
cancer_death_rates.head()| State | Types.Breast.Total | Types.Colorectal.Total | Types.Lung.Total | |
|---|---|---|---|---|
| 0 | Alabama | 27.4 | 19.4 | 66.4 | 
| 1 | Alaska | 17.8 | 11.9 | 36.6 | 
| 2 | Arizona | 23.3 | 14.9 | 42.3 | 
| 3 | Arkansas | 27.9 | 21.2 | 73.3 | 
| 4 | California | 23.0 | 14.0 | 34.5 | 
We want to include the normalized state demographic data in with the cancer data, but we don’t want all the columns. Earlier, we saw that we can use filter() to select columns using regular expressions. We will do that again here to select only the categories of variables that we are interested in.
# Create a filtered version of normalized_state_demographics data frame containing:
# - State column
# - Columns starting with "Age"
# - Columns starting with "Education"
# - Columns starting with "Ethnicities"
# - Columns starting with "Housing"
# - Columns starting with "Income"
# This uses regex patterns with filter() to select columns,
# then combines them using pd.concat()
filtered_normalized_state_demographics = pd.concat(
    [
        normalized_state_demographics.filter(["State"]),
        normalized_state_demographics.filter(regex=r"^Age"),
        normalized_state_demographics.filter(regex=r"^Education"),
        normalized_state_demographics.filter(regex=r"^Ethnicities"),
        normalized_state_demographics.filter(regex=r"^Housing"),
        normalized_state_demographics.filter(regex=r"^Income"),
    ],
    axis="columns",
)
filtered_normalized_state_demographics.head()| State | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Education.High School or Higher | Education.Bachelor's Degree or Higher | Ethnicities.White Alone | Ethnicities.Black Alone | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | ... | Housing.Housing Units | Housing.Homeownership Rate | Housing.Median Value of Owner-Occupied Units | Housing.Households | Housing.Persons per Household | Housing.Households with a computer | Housing.Households with a Internet | Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Connecticut | 5.1 | 20.4 | 17.7 | 90.6 | 39.3 | 79.7 | 12.2 | 0.6 | 5.0 | ... | 4266.789625 | 66.1 | 275400 | 3835.223275 | 2.53 | 90.8 | 85.5 | 78444 | 44496 | 10.0 | 
| 1 | Delaware | 5.6 | 20.9 | 19.4 | 90.0 | 32.0 | 69.2 | 23.2 | 0.7 | 4.1 | ... | 4942.245198 | 71.2 | 251100 | 4046.199387 | 2.57 | 91.6 | 85.0 | 68287 | 35450 | 11.3 | 
| 2 | District of Columbia | 6.4 | 18.2 | 12.4 | 90.9 | 58.5 | 46.0 | 46.0 | 0.6 | 4.5 | ... | 5364.478340 | 41.6 | 601500 | 4726.194611 | 2.30 | 91.8 | 82.6 | 86420 | 56147 | 13.5 | 
| 3 | Florida | 5.3 | 19.7 | 20.9 | 88.2 | 29.9 | 77.3 | 16.9 | 0.5 | 3.0 | ... | 5145.217009 | 65.4 | 215300 | 4114.772322 | 2.65 | 91.5 | 83.0 | 55660 | 31619 | 12.7 | 
| 4 | Georgia | 6.2 | 23.6 | 14.3 | 87.1 | 31.3 | 60.2 | 32.6 | 0.5 | 4.4 | ... | 4519.558039 | 63.3 | 176000 | 3879.988270 | 2.70 | 90.2 | 81.3 | 58700 | 31067 | 13.3 | 
5 rows × 24 columns
Now, we can merge the data:
cancer_demographics = cancer_death_rates.merge(
    filtered_normalized_state_demographics, on="State", how="inner"
)
cancer_demographics.head()| State | Types.Breast.Total | Types.Colorectal.Total | Types.Lung.Total | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Education.High School or Higher | Education.Bachelor's Degree or Higher | Ethnicities.White Alone | ... | Housing.Housing Units | Housing.Homeownership Rate | Housing.Median Value of Owner-Occupied Units | Housing.Households | Housing.Persons per Household | Housing.Households with a computer | Housing.Households with a Internet | Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | 27.4 | 19.4 | 66.4 | 6.0 | 22.2 | 17.3 | 86.2 | 25.5 | 69.1 | ... | 4780.278660 | 68.8 | 142700 | 3907.941778 | 2.55 | 85.5 | 76.4 | 50536 | 27928 | 15.5 | 
| 1 | Alaska | 17.8 | 11.9 | 36.6 | 7.0 | 24.6 | 12.5 | 92.8 | 29.6 | 65.3 | ... | 4503.520686 | 64.3 | 270400 | 3567.092960 | 2.80 | 94.1 | 85.5 | 77640 | 36787 | 10.1 | 
| 2 | Arizona | 23.3 | 14.9 | 42.3 | 5.9 | 22.5 | 18.0 | 87.1 | 29.5 | 82.6 | ... | 4812.222809 | 64.4 | 225500 | 4022.623845 | 2.68 | 91.7 | 84.1 | 58945 | 30694 | 13.5 | 
| 3 | Arkansas | 27.9 | 21.2 | 73.3 | 6.2 | 23.2 | 17.4 | 86.6 | 23.0 | 79.0 | ... | 4763.950838 | 65.6 | 127800 | 3971.548583 | 2.52 | 86.2 | 73.0 | 47597 | 26577 | 16.2 | 
| 4 | California | 23.0 | 14.0 | 34.5 | 6.0 | 22.5 | 14.8 | 83.3 | 33.9 | 71.9 | ... | 3856.324950 | 54.8 | 505000 | 3501.444518 | 2.95 | 93.0 | 86.7 | 75235 | 36955 | 11.8 | 
5 rows × 27 columns
Let’s do another correlation matrix:
cancer_demographics_full_correlation_matrix = cancer_demographics.corr(
    numeric_only=True
)
cancer_demographics_full_correlation_matrix| Types.Breast.Total | Types.Colorectal.Total | Types.Lung.Total | Age.Percent Under 5 Years | Age.Percent Under 18 Years | Age.Percent 65 and Older | Education.High School or Higher | Education.Bachelor's Degree or Higher | Ethnicities.White Alone | Ethnicities.Black Alone | ... | Housing.Housing Units | Housing.Homeownership Rate | Housing.Median Value of Owner-Occupied Units | Housing.Households | Housing.Persons per Household | Housing.Households with a computer | Housing.Households with a Internet | Income.Median Houseold Income | Income.Per Capita Income | Income.Persons Below Poverty Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Types.Breast.Total | 1.000000 | 0.820604 | 0.738585 | -0.388781 | -0.498962 | 0.451119 | -0.131677 | -0.019331 | 0.051875 | 0.419682 | ... | 0.262656 | 0.042882 | -0.288228 | 0.375702 | -0.654101 | -0.541995 | -0.398493 | -0.275447 | 0.054492 | 0.331260 | 
| Types.Colorectal.Total | 0.820604 | 1.000000 | 0.861898 | -0.314083 | -0.381218 | 0.600918 | -0.158118 | -0.361382 | 0.097213 | 0.156050 | ... | 0.280906 | 0.186077 | -0.418381 | 0.264358 | -0.618531 | -0.753502 | -0.612936 | -0.502661 | -0.269516 | 0.443643 | 
| Types.Lung.Total | 0.738585 | 0.861898 | 1.000000 | -0.435992 | -0.411047 | 0.609561 | -0.216806 | -0.437903 | 0.146942 | 0.188569 | ... | 0.303359 | 0.331816 | -0.530434 | 0.203042 | -0.571173 | -0.706384 | -0.577125 | -0.574831 | -0.363581 | 0.450317 | 
| Age.Percent Under 5 Years | -0.388781 | -0.314083 | -0.435992 | 1.000000 | 0.879837 | -0.753746 | -0.070116 | -0.171279 | -0.169575 | 0.064308 | ... | -0.205294 | -0.165581 | -0.069280 | -0.084758 | 0.401366 | 0.095203 | -0.098394 | -0.037176 | -0.190507 | 0.113660 | 
| Age.Percent Under 18 Years | -0.498962 | -0.381218 | -0.411047 | 0.879837 | 1.000000 | -0.617604 | -0.128687 | -0.408223 | 0.064120 | -0.122481 | ... | -0.343985 | 0.172119 | -0.309091 | -0.300107 | 0.515735 | 0.062107 | -0.115255 | -0.207411 | -0.458687 | 0.107673 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| Housing.Households with a computer | -0.541995 | -0.753502 | -0.706384 | 0.095203 | 0.062107 | -0.364809 | 0.471508 | 0.575901 | -0.000697 | -0.280283 | ... | -0.127756 | -0.264749 | 0.589177 | -0.010558 | 0.370020 | 1.000000 | 0.923669 | 0.749979 | 0.565945 | -0.791368 | 
| Housing.Households with a Internet | -0.398493 | -0.612936 | -0.577125 | -0.098394 | -0.115255 | -0.201042 | 0.516615 | 0.637379 | 0.045414 | -0.323449 | ... | -0.215326 | -0.206746 | 0.602584 | -0.039747 | 0.257074 | 0.923669 | 1.000000 | 0.820815 | 0.656077 | -0.878474 | 
| Income.Median Houseold Income | -0.275447 | -0.502661 | -0.574831 | -0.037176 | -0.207411 | -0.332490 | 0.433715 | 0.825304 | -0.311793 | -0.017949 | ... | -0.301793 | -0.429184 | 0.799816 | -0.114904 | 0.247522 | 0.749979 | 0.820815 | 1.000000 | 0.895940 | -0.755590 | 
| Income.Per Capita Income | 0.054492 | -0.269516 | -0.363581 | -0.190507 | -0.458687 | -0.233495 | 0.384923 | 0.925678 | -0.274597 | 0.169388 | ... | -0.059754 | -0.547328 | 0.734335 | 0.185974 | -0.099105 | 0.565945 | 0.656077 | 0.895940 | 1.000000 | -0.586132 | 
| Income.Persons Below Poverty Level | 0.331260 | 0.443643 | 0.450317 | 0.113660 | 0.107673 | 0.068212 | -0.743634 | -0.566515 | -0.176359 | 0.434675 | ... | 0.074541 | -0.024846 | -0.435180 | -0.057847 | -0.051614 | -0.791368 | -0.878474 | -0.755590 | -0.586132 | 1.000000 | 
26 rows × 26 columns
This will give every variable against all other variables, but in this case we don’t want to plot all that. We just want to see the correlation of the state demographic data to the cancer death data, and not the state demographic data with itself again.
So, let’s filter out the rows and columns that we don’t need.
cancer_columns = ["Types.Breast.Total", "Types.Colorectal.Total", "Types.Lung.Total"]
cancer_demographics_correlation_matrix = cancer_demographics_full_correlation_matrix[
    cancer_columns
].drop(cancer_columns, axis="rows")
cancer_demographics_correlation_matrix| Types.Breast.Total | Types.Colorectal.Total | Types.Lung.Total | |
|---|---|---|---|
| Age.Percent Under 5 Years | -0.388781 | -0.314083 | -0.435992 | 
| Age.Percent Under 18 Years | -0.498962 | -0.381218 | -0.411047 | 
| Age.Percent 65 and Older | 0.451119 | 0.600918 | 0.609561 | 
| Education.High School or Higher | -0.131677 | -0.158118 | -0.216806 | 
| Education.Bachelor's Degree or Higher | -0.019331 | -0.361382 | -0.437903 | 
| ... | ... | ... | ... | 
| Housing.Households with a computer | -0.541995 | -0.753502 | -0.706384 | 
| Housing.Households with a Internet | -0.398493 | -0.612936 | -0.577125 | 
| Income.Median Houseold Income | -0.275447 | -0.502661 | -0.574831 | 
| Income.Per Capita Income | 0.054492 | -0.269516 | -0.363581 | 
| Income.Persons Below Poverty Level | 0.331260 | 0.443643 | 0.450317 | 
23 rows × 3 columns
And now we can generate another heatmap.
sns.clustermap(
    cancer_demographics_correlation_matrix,
    # Specify the complete linkage for calculating clusters
    method="complete",
    # The relative space the dendrograms will occupy
    dendrogram_ratio=0.15,
    # Use the "icefire" diverging palette
    cmap="icefire",
    # Make sure the min color value occurs at -1
    vmin=-1,
    # Make sure the max color value occurs at 1
    vmax=1,
    # figsize=(12, 12),
    # Remove the x-axis tick labels
    xticklabels=True,
    # Remove the y-axis tick labels
    yticklabels=True,
    # Set the options for the color palette legend
    cbar_kws={
        "label": "Correlation",  # Set the label for the color palette legend
        "location": "bottom",  # Set the location of the color palette legend
    },
    # Set the location for the color palette legend
    # This is for top left
    cbar_pos=(
        0.03,  # Distance from the left
        0.92,  # Distance from the bottom
        0.10,  # Width
        0.05,  # Height
    ),
)
We can definitely see some patterns emerging. The groups most highly correlated with cancer deaths were people 65 and older, followed by people below the poverty line. In contrast, groups that were least correlated with cancer deaths included those who lived in households with a computer and fewer people per household.
There are also some interesting patterns related to ethnicity. For example, there is almost no correlation between ethnicity and colorectal or lung cancer death for Pacific Islanders, but a negative correlation for breast cancer. Meanwhile, there is a higher correlation between breast cancer deaths and ethnicity for Black people than for colorectal or lung cancer, which are both close to zero.
These patterns bring up some interesting questions. Are Pacific Islanders less likely to develop breast cancer than Black people? Or do the two groups tend to develop different types of breast cancer? Or are there social determinants or biases in healthcare that make breast cancer more deadly for Black people?
Exploratory data analysis, as you’ve seen, will show trends, but it is up to scientists and other domain experts to interpret the data and to determine causes.
Cancer Data Wrap-Up
In this section we learned some tricks about how to combine multiple datasets, and how to look for interesting data trends by including more metadata into our analysis.
You may have noticed that there are a lot more variables in the cancer dataset. You could definitely imagine doing a lot more with this data!
7.5 Public Health Data
Let’s check out some public health data next. This data is a bit different in that it has data for the states over many years, and it includes multiple diseases. This makes it a neat resource to learn a few more Pandas tricks!
Basics
We start by importing the data.
disease = pd.read_csv("../_data/health.csv")
disease.head()| disease | increase | loc | number | population | year | |
|---|---|---|---|---|---|---|
| 0 | MEASLES | 334.99 | ALABAMA | 8843 | 2640000 | 1928 | 
| 1 | MEASLES | 200.75 | ARIZONA | 847 | 422000 | 1928 | 
| 2 | MEASLES | 481.77 | ARKANSAS | 8899 | 1847000 | 1928 | 
| 3 | MEASLES | 69.22 | CALIFORNIA | 3698 | 5344000 | 1928 | 
| 4 | MEASLES | 206.98 | COLORADO | 2099 | 1014000 | 1928 | 
The first thing I want to do is clean it up a little bit. I like columns to use title case and to avoid abbreviations that aren’t in common usage. Additionally, the other datasets we looked at didn’t have entries in all caps, so I would like to fix that as well.
disease = (
    # Start with the disease DataFrame
    disease
    # Rename the columns to more readable format with capital letters
    .rename(
        columns={
            # Change 'disease' to 'Disease'
            "disease": "Disease",
            # Change 'increase' to 'Increase'
            "increase": "Increase",
            # Change 'loc' to 'State'
            "loc": "State",
            # Change 'number' to 'Cases'
            "number": "Cases",
            # Change 'population' as 'Population'
            "population": "Population",
            # Change 'year' to 'Year'
            "year": "Year",
        }
    )
    # Use assign() to create or modify columns without modifying the original
    # DataFrame
    .assign(
        # Convert the Disease column text to title case
        # (first letter of each word capitalized)
        Disease=lambda df: df["Disease"].str.title(),
        # Convert the State column text to title case
        # (first letter of each word capitalized)
        State=lambda df: df["State"].str.title(),
    )
)
disease| Disease | Increase | State | Cases | Population | Year | |
|---|---|---|---|---|---|---|
| 0 | Measles | 334.99 | Alabama | 8843 | 2640000 | 1928 | 
| 1 | Measles | 200.75 | Arizona | 847 | 422000 | 1928 | 
| 2 | Measles | 481.77 | Arkansas | 8899 | 1847000 | 1928 | 
| 3 | Measles | 69.22 | California | 3698 | 5344000 | 1928 | 
| 4 | Measles | 206.98 | Colorado | 2099 | 1014000 | 1928 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 14260 | Pertussis | 2.06 | Virginia | 171 | 8096604 | 2011 | 
| 14261 | Pertussis | 10.68 | Washington | 731 | 6830038 | 2011 | 
| 14262 | Pertussis | 1.99 | West Virginia | 37 | 1855364 | 2011 | 
| 14263 | Pertussis | 6.20 | Wisconsin | 351 | 5711767 | 2011 | 
| 14264 | Pertussis | 1.25 | Wyoming | 7 | 568158 | 2011 | 
14265 rows × 6 columns
We use the method chaining style again: each method returns a DataFrame that the next method operates on.
That lambda usage in assign() might be a bit obscure. Let’s break it down.
.assign(
    Disease=lambda df: df["Disease"].str.title()
)- Diseasewill be the column that holds the result of the evaluation of the- lambdafunction
- lambda df:- This creates a small anonymous function that takes a DataFramenameddfas input.
- Which data frame? Well, it’s the data frame that we are currently working on. If that is too mind-bending, just roll with it for now.
 
- This creates a small anonymous function that takes a 
- df["Disease"]selects just the Disease column from the data frame- df
- .str.title()is a Pandas string method that converts text to “Title Case” (capitalizes the first letter of each word).
So when you see Disease=lambda df: df["Disease"].str.title() in the assign() method, it means “create a new Disease column by taking the values from the existing Disease column and converting each value to title case.”
Note: Pandas has a special way of working with string data. To learn more, check out the working with text data docs.
Disease Rates
Similar to the other datasets, we will want to look at disease rates and not raw numbers since we don’t want the state population to influence our results. This time, let’s modify the existing data frame directly. We could have used assign() again, but I want to give you a little variety, in case you see something similar in the wild.
disease["CasesPer100k"] = disease["Cases"] / disease["Population"] * 100_000
disease.head()| Disease | Increase | State | Cases | Population | Year | CasesPer100k | |
|---|---|---|---|---|---|---|---|
| 0 | Measles | 334.99 | Alabama | 8843 | 2640000 | 1928 | 334.962121 | 
| 1 | Measles | 200.75 | Arizona | 847 | 422000 | 1928 | 200.710900 | 
| 2 | Measles | 481.77 | Arkansas | 8899 | 1847000 | 1928 | 481.808338 | 
| 3 | Measles | 69.22 | California | 3698 | 5344000 | 1928 | 69.199102 | 
| 4 | Measles | 206.98 | Colorado | 2099 | 1014000 | 1928 | 207.001972 | 
Some of these columns are no longer needed, so let’s drop them.
disease = disease.drop(columns=["Increase", "Population"])
disease| Disease | State | Cases | Year | CasesPer100k | |
|---|---|---|---|---|---|
| 0 | Measles | Alabama | 8843 | 1928 | 334.962121 | 
| 1 | Measles | Arizona | 847 | 1928 | 200.710900 | 
| 2 | Measles | Arkansas | 8899 | 1928 | 481.808338 | 
| 3 | Measles | California | 3698 | 1928 | 69.199102 | 
| 4 | Measles | Colorado | 2099 | 1928 | 207.001972 | 
| ... | ... | ... | ... | ... | ... | 
| 14260 | Pertussis | Virginia | 171 | 2011 | 2.111997 | 
| 14261 | Pertussis | Washington | 731 | 2011 | 10.702722 | 
| 14262 | Pertussis | West Virginia | 37 | 2011 | 1.994218 | 
| 14263 | Pertussis | Wisconsin | 351 | 2011 | 6.145209 | 
| 14264 | Pertussis | Wyoming | 7 | 2011 | 1.232052 | 
14265 rows × 5 columns
Grouping Data
Now that we have cleaned things up a little, let’s summarize the data.
disease.describe()| Cases | Year | CasesPer100k | |
|---|---|---|---|
| count | 14265.000000 | 14265.000000 | 14265.000000 | 
| mean | 1680.833719 | 1969.932983 | 57.365462 | 
| std | 6219.396618 | 23.050123 | 165.211789 | 
| min | 0.000000 | 1928.000000 | 0.000000 | 
| 25% | 17.000000 | 1948.000000 | 0.718467 | 
| 50% | 109.000000 | 1973.000000 | 3.928048 | 
| 75% | 677.000000 | 1989.000000 | 25.368169 | 
| max | 132342.000000 | 2011.000000 | 2974.789916 | 
Huh, that really didn’t give us useful information did it? The way this data is structured means that multiple columns are needed to identify unique observations. In this case a single observation is uniquely identified by a combination of state-year-disease. When we use describe, or other summarizing methods which we will see in a bit, we will first need to group the observations by subsets of the columns that uniquely identify them.
For example, to see averages for diseases per year across all states, we would group by Year and Disease:
disease.groupby(by=["Year", "Disease"]).describe()| Cases | CasesPer100k | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | ||
| Year | Disease | ||||||||||||||||
| 1928 | Measles | 47.0 | 10283.765957 | 16529.355868 | 0.0 | 1236.50 | 3698.0 | 8656.50 | 75391.0 | 47.0 | 360.119785 | 355.941940 | 0.000000 | 130.758467 | 256.223176 | 508.753856 | 1823.718365 | 
| Polio | 48.0 | 99.083333 | 155.322154 | 6.0 | 24.00 | 45.0 | 89.75 | 947.0 | 48.0 | 4.816878 | 4.951962 | 0.413365 | 1.563007 | 2.666954 | 6.917508 | 20.760799 | |
| Smallpox | 49.0 | 744.285714 | 924.561878 | 0.0 | 83.00 | 341.0 | 1285.00 | 3917.0 | 49.0 | 41.261417 | 49.954931 | 0.000000 | 2.793296 | 18.525449 | 73.181818 | 204.166667 | |
| 1929 | Measles | 46.0 | 7370.891304 | 11334.755452 | 211.0 | 1102.50 | 2164.5 | 7100.75 | 47605.0 | 46.0 | 262.191202 | 233.787770 | 12.133410 | 73.116765 | 182.629793 | 370.208359 | 1016.530334 | 
| Polio | 48.0 | 57.208333 | 91.662879 | 3.0 | 9.00 | 20.5 | 63.25 | 530.0 | 48.0 | 2.040845 | 1.718013 | 0.215983 | 0.839060 | 1.532462 | 2.824737 | 9.938144 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 2009 | Pertussis | 51.0 | 175.294118 | 208.747244 | 1.0 | 54.50 | 109.0 | 194.00 | 1081.0 | 51.0 | 3.254531 | 2.049602 | 0.166762 | 1.685645 | 2.778082 | 4.144554 | 10.021862 | 
| 2010 | Hepatitis A | 50.0 | 81.200000 | 396.252599 | 0.0 | 4.75 | 12.5 | 30.50 | 2815.0 | 50.0 | 0.667417 | 1.991235 | 0.000000 | 0.244203 | 0.397353 | 0.514911 | 14.378582 | 
| Pertussis | 51.0 | 276.705882 | 373.632830 | 6.0 | 43.00 | 132.0 | 326.00 | 1810.0 | 51.0 | 4.811068 | 4.255464 | 0.095149 | 1.927061 | 3.394460 | 5.986279 | 20.872745 | |
| 2011 | Hepatitis A | 48.0 | 22.958333 | 34.008108 | 0.0 | 4.00 | 8.5 | 28.00 | 176.0 | 48.0 | 0.289574 | 0.169517 | 0.000000 | 0.168690 | 0.278355 | 0.404930 | 0.798173 | 
| Pertussis | 51.0 | 217.647059 | 261.697522 | 5.0 | 46.50 | 111.0 | 275.00 | 1145.0 | 51.0 | 4.099466 | 3.043221 | 0.306022 | 1.679363 | 3.037787 | 6.325108 | 11.821290 | |
315 rows × 16 columns
Or, to see averages for diseases per state across all years, we would group by State and Disease:
disease.groupby(by=["State", "Disease"]).describe()| Cases | Year | CasesPer100k | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | ||
| State | Disease | |||||||||||||||||||||
| Alabama | Hepatitis A | 46.0 | 136.913043 | 130.868182 | 6.0 | 35.50 | 73.5 | 216.50 | 467.0 | 46.0 | 1988.500000 | ... | 1999.75 | 2011.0 | 46.0 | 3.648793 | 3.729064 | 0.132973 | 0.860860 | 1.717768 | 5.708617 | 12.877434 | 
| Measles | 66.0 | 3175.484848 | 4196.365912 | 0.0 | 23.50 | 1774.0 | 3784.75 | 18118.0 | 66.0 | 1961.166667 | ... | 1977.75 | 2002.0 | 66.0 | 107.593367 | 143.192331 | 0.000000 | 0.600964 | 53.786844 | 129.980033 | 590.279330 | |
| Mumps | 34.0 | 185.735294 | 285.659185 | 1.0 | 7.25 | 19.0 | 363.75 | 1025.0 | 34.0 | 1984.529412 | ... | 1992.75 | 2002.0 | 34.0 | 5.102450 | 7.967464 | 0.025053 | 0.167004 | 0.476587 | 9.596426 | 28.959711 | |
| Pertussis | 55.0 | 464.163636 | 679.201295 | 1.0 | 24.00 | 65.0 | 996.50 | 2249.0 | 55.0 | 1977.654545 | ... | 1997.50 | 2011.0 | 55.0 | 15.573160 | 23.568290 | 0.025173 | 0.587737 | 1.552425 | 33.401600 | 76.444596 | |
| Polio | 39.0 | 155.230769 | 186.117176 | 0.0 | 50.50 | 67.0 | 188.50 | 796.0 | 39.0 | 1947.076923 | ... | 1956.50 | 1968.0 | 39.0 | 5.265726 | 6.262218 | 0.000000 | 1.685470 | 2.348485 | 6.273355 | 27.429359 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| Wyoming | Mumps | 28.0 | 61.750000 | 142.390861 | 0.0 | 2.00 | 3.5 | 7.25 | 436.0 | 28.0 | 1983.821429 | ... | 1992.25 | 2001.0 | 28.0 | 17.632509 | 41.088007 | 0.000000 | 0.402704 | 0.726743 | 1.753256 | 123.148148 | 
| Pertussis | 50.0 | 98.660000 | 148.652534 | 0.0 | 4.00 | 12.0 | 154.50 | 559.0 | 50.0 | 1977.120000 | ... | 1998.75 | 2011.0 | 50.0 | 36.635964 | 58.913155 | 0.000000 | 0.846719 | 2.319264 | 55.790924 | 227.235772 | |
| Polio | 36.0 | 41.805556 | 60.885757 | 1.0 | 6.75 | 19.0 | 42.00 | 273.0 | 36.0 | 1945.527778 | ... | 1954.25 | 1964.0 | 36.0 | 15.125636 | 20.958616 | 0.296736 | 2.869318 | 6.709717 | 15.650277 | 93.174061 | |
| Rubella | 21.0 | 66.714286 | 191.147624 | 1.0 | 2.00 | 6.0 | 12.00 | 861.0 | 21.0 | 1978.095238 | ... | 1984.00 | 1995.0 | 21.0 | 19.367791 | 55.357490 | 0.203371 | 0.440929 | 1.242236 | 2.547296 | 247.880350 | |
| Smallpox | 24.0 | 74.250000 | 105.328162 | 1.0 | 2.00 | 9.5 | 119.00 | 334.0 | 24.0 | 1939.625000 | ... | 1945.25 | 1952.0 | 24.0 | 31.957904 | 45.915825 | 0.341297 | 0.779701 | 3.886957 | 50.388562 | 149.775785 | |
355 rows × 24 columns
And so on, with any combination required.
Note: We didn’t show it here, but you can also group by single columns, too.
You may have noticed that there is too much data to really get a sense of it in the interactive Quarto prompt. The quickest way around this is to use the to_clipboard() method, and then paste the data into your favorite spreadsheet program and you can get a better look.
disease.groupby(by=["State", "Disease"]).describe().to_clipboard()The to_clipboard() function can be very handy!
Plotting the Data
Having gotten ourselves acquainted with the data, we should make some plots! Let’s see how common the diseases were across all states and all years for which we have data.
(
    # Select only the columns we need for this chart
    disease[["Disease", "Cases"]]
    # Group the data by disease
    .groupby(by="Disease")
    # Take the sum of the cases
    .agg({"Cases": "sum"})
    # The groupby() operation will make Disease the row names,
    # but we want them as a column.
    .reset_index("Disease")
    # Sort the rows by number of cases
    .sort_values("Cases", ascending=False)
    # Plot the data
    .plot(
        # Do a bar chart
        kind="bar",
        # The x-axis is disease name
        x="Disease",
        # The y-axis is the number of cases
        y="Cases",
        # We want to log transform the y-axis
        logy=True,
        # Drop the legend since we only have a single series of data to plot
        legend=False,
        # Give the y-axis a label
        ylabel="Cases",
    )
)
Measles is by far the most common disease in the dataset. Let’s see if the prevalence of different diseases changes over time.
plot_data = (
    disease[["Disease", "Year", "Cases"]]
    .groupby(by=["Disease", "Year"])
    .agg({"Cases": "sum"})
    .reset_index(["Disease", "Year"])
)
plot_data.plot(
    kind="line",
    x="Year",
    y="Cases",
)
Oops! That’s not what we want. Pandas uses matplotlib under the hood, and by default, it works more naturally with a slightly different data format. For reference, here is the data we just tried to plot:
plot_data| Disease | Year | Cases | |
|---|---|---|---|
| 0 | Hepatitis A | 1966 | 32416 | 
| 1 | Hepatitis A | 1967 | 38280 | 
| 2 | Hepatitis A | 1968 | 45204 | 
| 3 | Hepatitis A | 1969 | 46012 | 
| 4 | Hepatitis A | 1970 | 55188 | 
| ... | ... | ... | ... | 
| 310 | Smallpox | 1948 | 446 | 
| 311 | Smallpox | 1949 | 67 | 
| 312 | Smallpox | 1950 | 42 | 
| 313 | Smallpox | 1951 | 19 | 
| 314 | Smallpox | 1952 | 16 | 
315 rows × 3 columns
Let’s adjust the format to one that works better for the plotting function:
plot_data = (
    # Select the columns we need
    disease[["Disease", "Year", "Cases"]]
    # Group by disease-year pairs
    .groupby(by=["Disease", "Year"])
    # Get the sum of the cases across the grouping variables
    .agg({"Cases": "sum"})
    # Convert the row names back to columns
    .reset_index(["Disease", "Year"])
    # Pivot! (See below for an explanation)
    .pivot(
        # Set the Year column to be the row index of the resulting table
        index="Year",
        # Take the new columns from the unique values in the Disease column
        columns="Disease",
        # Take the values for the new columns from the data in the Cases column
        values="Cases",
    )
    # Pivoting sets Year as the row names, so convert it back to a column
    .reset_index("Year")
)
# Don't worry about this--it's just to make the data print in a nicer way.
plot_data.columns.name = None
plot_data| Year | Hepatitis A | Measles | Mumps | Pertussis | Polio | Rubella | Smallpox | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1928 | NaN | 483337.0 | NaN | NaN | 4756.0 | NaN | 36470.0 | 
| 1 | 1929 | NaN | 339061.0 | NaN | NaN | 2746.0 | NaN | 38389.0 | 
| 2 | 1930 | NaN | 384597.0 | NaN | NaN | 8964.0 | NaN | 45728.0 | 
| 3 | 1931 | NaN | 438435.0 | NaN | NaN | 15743.0 | NaN | 28708.0 | 
| 4 | 1932 | NaN | 390114.0 | NaN | NaN | 3829.0 | NaN | 10740.0 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 79 | 2007 | 2243.0 | NaN | NaN | 6999.0 | NaN | NaN | NaN | 
| 80 | 2008 | 2030.0 | NaN | NaN | 6754.0 | NaN | NaN | NaN | 
| 81 | 2009 | 1580.0 | NaN | NaN | 8940.0 | NaN | NaN | NaN | 
| 82 | 2010 | 4060.0 | NaN | NaN | 14112.0 | NaN | NaN | NaN | 
| 83 | 2011 | 1102.0 | NaN | NaN | 11100.0 | NaN | NaN | NaN | 
84 rows × 8 columns
As you can see, we’ve taken the original data and made it wider. (If you’re familiar with R’s Tidyverse, you may recognize this operation as taking tidy data and turning into so-called “messy” data.)
The only new thing here is the pivot() function. Pivoting data can be a little tricky to get used to. The basic operation is taking unique values from a column and using each of those unique values to create a new column. (It’s actually more flexible than that, but that’s the general idea.)
In this case, we want to make new columns from the unique values in the Disease column (e.g., Measles, Mumps, and Rubella). The values of those new columns will be taken from the Cases column.
Now we can plot it!
disease_by_year_plot = plot_data.plot(
    # We want a line chart
    kind="line",
    # Put Year on the x-axis
    # (The y-axis is set implicitly as the other columns)
    x="Year",
    # Log transform the y-axis
    logy=True,
    # Give the y-axis a label
    ylabel="Cases",
)
What do you notice about this data? Let me list out some of the questions/observations that might come to mind when first looking at this plot:
- Why do many of the diseases have sharp declines and eventually drop out of the plot after a while?
- Why is there a gap in the Pertussis (whooping cough) data?
- Why is Pertussis the only one that has been on a steady increase since the 1970s?
- What’s with the big spike in measles cases in the 1980s?
- Why don’t all the diseases have data going back to the earliest years in the dataset?
- Are there any state-by-state trends, or do they mainly follow the national trends?
- Hang on, these are all diseases that we have good vaccination programs for….
Note: The corgis page with this data doesn’t provide much info, but I think it’s fairly safe to assume that when there are no cases reported for a given disease in a given year, that probably means there were no observed cases of that disease in that year.
Once you get to that last point, a lot of the other questions are probably pretty easy to explain! Let’s see if the introduction of the vaccine for these diseases corresponds with the decline in cases.
The polio vaccine was tested in 1954 and introduced in 1955 by Jonas Salk, becoming a significant breakthrough in the fight against polio. Let’s draw a little vertical dashed line at the year 1955 and see what it looks like.
plot_data.plot(kind="line", x="Year", logy=True, ylabel="Cases")
# Draw a gray, dashed, vertical line at year 1955
plt.axvline(x=1955, color="#999999", linestyle="--")
Yep! As you might have guessed, the polio cases really start a sharp decline after the introduction of the vaccine.
What about smallpox? The history of smallpox vaccination is a bit more circuitous than polio, but two key dates I found were 1939, which saw the start of egg-based smallpox vaccine production by the Texas Department of Health, and 1948, which was when they began to be used in vaccination campaigns (see Wikipedia’s Smallpox vaccine article). So let’s draw the plot again with those two dates highlighted.
plot_data.plot(kind="line", x="Year", logy=True, ylabel="Cases")
plt.axvline(x=1939, color="#999999", linestyle="--")
plt.axvline(x=1948, color="#999999", linestyle="--")
Again, that sharp decline we see lines up pretty well with those key dates in the history of the smallpox vaccination program.
So, we have now gotten a pretty solid working hypothesis that the drop offs in cases for these diseases are probably due to the introduction of vaccination programs! But what about the interesting variability shown in the measles and pertussis data? I will leave that as an exercise for the reader….
Public Health Data Wrap-Up
By examining public health data, we’ve gained experience working with time series data. We’ve also seen how the patterns we observe in this data can help direct our research questions and approaches.
7.6 Pandas Tour
To wrap up, let’s summarize everything by giving you a “little” reference that you can use for guidance.
For this, we can use the classic coffee data.
coffee = pd.read_csv("../_data/coffee_small.csv")
coffee| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 6 columns
Creating Data Frames
There are a ton of ways to create data frames in Pandas. (E.g., check out the docs for DataFrame and the pages linked from that page.) However, to keep it simple, we will stick with read_csv() and creating data frames from Python dictionaries.
Reading from a CSV file:
df = read_csv("/path/to/data.csv")Using a dictionary to specify columns:
df = pd.DataFrame(
    {
        "Gene": ["gene_1", "gene_2", "gene_3"],
        "Sequence": ["ACTG", "AAGT", "GGCT"],
        "Sample": ["Sample 1", "Sample 2", "Sample 2"],
    }
)
df| Gene | Sequence | Sample | |
|---|---|---|---|
| 0 | gene_1 | ACTG | Sample 1 | 
| 1 | gene_2 | AAGT | Sample 2 | 
| 2 | gene_3 | GGCT | Sample 2 | 
Subsetting and Filtering Rows and Columns
Selecting Columns
To get individual columns, use the bracket notation.
coffee["Country"]0       Colombia
1       Colombia
2       Colombia
3       Colombia
4       Colombia
         ...    
410    Guatemala
411    Guatemala
412    Guatemala
413       Mexico
414       Mexico
Name: Country, Length: 415, dtype: objectTo select multiple columns, use the bracket notation, but pass in a list of column names rather than a single value.
coffee[["Country", "Year"]]| Country | Year | |
|---|---|---|
| 0 | Colombia | 2012 | 
| 1 | Colombia | 2012 | 
| 2 | Colombia | 2012 | 
| 3 | Colombia | 2012 | 
| 4 | Colombia | 2012 | 
| ... | ... | ... | 
| 410 | Guatemala | 2015 | 
| 411 | Guatemala | 2015 | 
| 412 | Guatemala | 2015 | 
| 413 | Mexico | 2015 | 
| 414 | Mexico | 2015 | 
415 rows × 2 columns
It is often a good idea to save the columns that you want to get in a list, and pass that instead. This can help keep things clear and neat, especially if you have a ton of columns that you’re selecting. Additionally, it can be useful if you build up the list of columns to subset programmatically.
columns = ["Country", "Year"]
coffee[columns]| Country | Year | |
|---|---|---|
| 0 | Colombia | 2012 | 
| 1 | Colombia | 2012 | 
| 2 | Colombia | 2012 | 
| 3 | Colombia | 2012 | 
| 4 | Colombia | 2012 | 
| ... | ... | ... | 
| 410 | Guatemala | 2015 | 
| 411 | Guatemala | 2015 | 
| 412 | Guatemala | 2015 | 
| 413 | Mexico | 2015 | 
| 414 | Mexico | 2015 | 
415 rows × 2 columns
Sometimes, you may only want a single row, but you want the output to be a DataFrame rather than a Series.
coffee[["Country"]]| Country | |
|---|---|
| 0 | Colombia | 
| 1 | Colombia | 
| 2 | Colombia | 
| 3 | Colombia | 
| 4 | Colombia | 
| ... | ... | 
| 410 | Guatemala | 
| 411 | Guatemala | 
| 412 | Guatemala | 
| 413 | Mexico | 
| 414 | Mexico | 
415 rows × 1 columns
Sometimes you want to select columns based on patterns. For example, to take all the columns that start with Score, we can use filter():
coffee.filter(regex=r"^Score")| Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|
| 0 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | 
| 410 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 4 columns
That is using a regular expression (regex) to specify matching any column whose name starts with Score. (If you need an introduction to regular expressions see Appendix M.)
You can also use a substring matching version:
coffee.filter(like="Score")| Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|
| 0 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | 
| 410 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 4 columns
Filtering Data
A common operation is to keep or reject rows of data based on their values in certain columns. For example, to keep study participants aged 65 and over, or to reject any counties with median income greater than $35,000. We can do this using query().
Here are some examples with the coffee data.
Select rows where the country is Mexico:
coffee.query("Country == 'Mexico'")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| 75 | Mexico | 2012 | 7.75 | 8.00 | 7.75 | 7.83 | 
| 76 | Mexico | 2012 | 7.83 | 7.58 | 7.67 | 7.92 | 
| 77 | Mexico | 2012 | 8.00 | 7.92 | 7.42 | 8.08 | 
| 78 | Mexico | 2012 | 7.92 | 7.83 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 381 | Mexico | 2014 | 7.58 | 7.75 | 7.42 | 7.50 | 
| 382 | Mexico | 2014 | 7.08 | 7.08 | 6.92 | 7.08 | 
| 383 | Mexico | 2014 | 6.92 | 7.00 | 6.83 | 6.92 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
195 rows × 6 columns
Select rows where the country is Colombia and the year is 2014:
coffee.query("Country == 'Colombia' and Year == 2014")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 327 | Colombia | 2014 | 7.92 | 7.75 | 7.67 | 7.92 | 
| 328 | Colombia | 2014 | 7.92 | 7.75 | 7.67 | 7.75 | 
| 329 | Colombia | 2014 | 7.67 | 7.75 | 7.67 | 7.67 | 
| 330 | Colombia | 2014 | 7.67 | 7.67 | 7.50 | 7.50 | 
| 331 | Colombia | 2014 | 7.83 | 7.75 | 7.58 | 7.58 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 337 | Colombia | 2014 | 7.58 | 7.50 | 7.50 | 7.67 | 
| 338 | Colombia | 2014 | 7.58 | 7.75 | 7.50 | 7.58 | 
| 339 | Colombia | 2014 | 7.83 | 7.67 | 7.50 | 7.58 | 
| 340 | Colombia | 2014 | 7.50 | 7.58 | 7.58 | 7.75 | 
| 341 | Colombia | 2014 | 7.67 | 7.42 | 7.42 | 7.58 | 
15 rows × 6 columns
Select rows where the aroma and flavor scores are both at least 8:
coffee.query("`Score.Aroma` >= 8 and `Score.Aroma` >= 8")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 31 | Guatemala | 2012 | 8.00 | 8.00 | 7.83 | 8.33 | 
| 32 | Guatemala | 2012 | 8.25 | 8.00 | 7.67 | 7.92 | 
| 33 | Guatemala | 2012 | 8.00 | 7.83 | 7.67 | 7.83 | 
| 38 | Guatemala | 2012 | 8.17 | 8.17 | 7.83 | 7.67 | 
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 344 | Guatemala | 2014 | 8.00 | 7.92 | 7.67 | 7.58 | 
| 350 | Guatemala | 2014 | 8.00 | 7.50 | 7.33 | 8.00 | 
| 384 | Colombia | 2015 | 8.17 | 7.92 | 7.83 | 8.00 | 
| 385 | Colombia | 2015 | 8.17 | 7.83 | 7.58 | 8.08 | 
| 391 | Colombia | 2015 | 8.00 | 7.50 | 7.50 | 7.67 | 
20 rows × 6 columns
Do you see how we put backticks around the column names? This is because it had a dot (.) in the name. Whenever you have special characters, like dot or spaces, you will need to use backticks around the variable name.
Select rows where the aftertaste score was at least 8 and the acidity was below 7:
coffee.query("`Score.Aftertaste` >= 8 or `Score.Acidity` < 7")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 30 | Guatemala | 2012 | 7.83 | 8.00 | 8.00 | 8.17 | 
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| 109 | Mexico | 2012 | 8.08 | 8.17 | 8.00 | 8.00 | 
| 193 | Mexico | 2012 | 7.33 | 7.17 | 7.00 | 6.75 | 
| 214 | Mexico | 2012 | 7.25 | 7.17 | 7.00 | 6.83 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 230 | Mexico | 2012 | 7.00 | 6.58 | 6.67 | 6.83 | 
| 233 | Mexico | 2012 | 6.92 | 6.92 | 6.92 | 6.92 | 
| 237 | Colombia | 2013 | 8.00 | 8.00 | 8.00 | 8.17 | 
| 383 | Mexico | 2014 | 6.92 | 7.00 | 6.83 | 6.92 | 
| 387 | Colombia | 2015 | 7.75 | 7.67 | 8.00 | 7.67 | 
13 rows × 6 columns
Select rows where the year is prior to 2015, or the Flavor and the Aftertaste are above 7.75.
coffee.query("Year < 2015 or (`Score.Flavor` > 7.75 and `Score.Aftertaste` > 7.75)")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 380 | Mexico | 2014 | 7.25 | 7.17 | 7.17 | 7.25 | 
| 381 | Mexico | 2014 | 7.58 | 7.75 | 7.42 | 7.50 | 
| 382 | Mexico | 2014 | 7.08 | 7.08 | 6.92 | 7.08 | 
| 383 | Mexico | 2014 | 6.92 | 7.00 | 6.83 | 6.92 | 
| 384 | Colombia | 2015 | 8.17 | 7.92 | 7.83 | 8.00 | 
385 rows × 6 columns
You can also use the bracket notation for filtering data:
coffee[coffee["Year"] >= 2014]| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 327 | Colombia | 2014 | 7.92 | 7.75 | 7.67 | 7.92 | 
| 328 | Colombia | 2014 | 7.92 | 7.75 | 7.67 | 7.75 | 
| 329 | Colombia | 2014 | 7.67 | 7.75 | 7.67 | 7.67 | 
| 330 | Colombia | 2014 | 7.67 | 7.67 | 7.50 | 7.50 | 
| 331 | Colombia | 2014 | 7.83 | 7.75 | 7.58 | 7.58 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
88 rows × 6 columns
You will find that Pandas has many ways to achieve the same thing. While this flexibility can be helpful especially if you are doing some exploratory analysis for your research, it is often a good idea to stick to one way of doing things, especially within a single script or data pipeline.
Selecting Rows and Columns
Sometimes, you may need to select rows and columns at the same time. For this, you can use loc() and iloc(). These tend to feel more natural in situations in which you have a row index. Our data does not, so let’s use this little data frame instead to illustrate what I mean.
df = pd.DataFrame(
    {"A": [1, 3, 5, 6], "B": [2, 1, 4, 7], "C": [4, 3, 3, 1], "D": [9, 7, 4, 2]},
    index=["Patient 1", "Patient 2", "Patient 3", "Patient 4"],
)
df| A | B | C | D | |
|---|---|---|---|---|
| Patient 1 | 1 | 2 | 4 | 9 | 
| Patient 2 | 3 | 1 | 3 | 7 | 
| Patient 3 | 5 | 4 | 3 | 4 | 
| Patient 4 | 6 | 7 | 1 | 2 | 
In this data frame, row names are “Patient 1” to “Patient 4” and the column names are A-D.
Getting a single value:
df.loc["Patient 2", "C"]np.int64(3)Getting the values for a single row as a series:
df.loc["Patient 4", :]A    6
B    7
C    1
D    2
Name: Patient 4, dtype: int64Getting a single column as a series:
df.loc[:, "A"]Patient 1    1
Patient 2    3
Patient 3    5
Patient 4    6
Name: A, dtype: int64Getting multiple rows for a single column:
df.loc[["Patient 1", "Patient 3"], "B"]Patient 1    2
Patient 3    4
Name: B, dtype: int64Getting multiple contiguous rows for a single column. (We can use Python’s slice notation for this. Just be aware that the end of the slice is included in the output when you use loc(), unlike the usual Python slicing.)
df.loc["Patient 1":"Patient 3", "B"]Patient 1    2
Patient 2    1
Patient 3    4
Name: B, dtype: int64Getting multiple columns for a single row:
df.loc["Patient 2", ["B", "D"]]B    1
D    7
Name: Patient 2, dtype: int64Getting multiple contiguous columns for a single row using slicing.
df.loc["Patient 2", "B":"D"]B    1
C    3
D    7
Name: Patient 2, dtype: int64Getting multiple rows and multiple columns:
df.loc[["Patient 1", "Patient 4"], ["B", "D"]]| B | D | |
|---|---|---|
| Patient 1 | 2 | 9 | 
| Patient 4 | 7 | 2 | 
And of course, you can mix and match slicing as required to get exactly the data you need:
df.loc["Patient 1":"Patient 3", ["A", "C"]]| A | C | |
|---|---|---|
| Patient 1 | 1 | 4 | 
| Patient 2 | 3 | 3 | 
| Patient 3 | 5 | 3 | 
Subsetting Rows
There are a few nice functions for sampling and downsizing your data frames. This can be helpful if you’re working with huge data, or if you just want to get a better handle on things before working on the full data.
Get the first few rows of a table:
coffee.head()| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 
Get the last few rows of a table:
coffee.tail()| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
Sample 5% of the rows of a table:
coffee.sample(frac=0.05)| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 7 | Colombia | 2012 | 7.67 | 7.58 | 7.58 | 7.58 | 
| 68 | Guatemala | 2012 | 7.50 | 7.00 | 6.83 | 7.50 | 
| 135 | Mexico | 2012 | 7.67 | 7.50 | 7.25 | 7.58 | 
| 96 | Mexico | 2012 | 7.58 | 7.75 | 7.42 | 7.67 | 
| 408 | Guatemala | 2015 | 7.50 | 7.42 | 7.17 | 7.83 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 378 | Mexico | 2014 | 7.42 | 7.42 | 7.67 | 8.00 | 
| 65 | Guatemala | 2012 | 7.17 | 7.17 | 6.50 | 7.17 | 
| 230 | Mexico | 2012 | 7.00 | 6.58 | 6.67 | 6.83 | 
| 237 | Colombia | 2013 | 8.00 | 8.00 | 8.00 | 8.17 | 
| 114 | Mexico | 2012 | 7.50 | 7.58 | 7.50 | 7.50 | 
21 rows × 6 columns
Randomly select 10 rows from a table:
coffee.sample(n=10)| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 172 | Mexico | 2012 | 7.17 | 7.08 | 7.25 | 7.42 | 
| 42 | Guatemala | 2012 | 7.50 | 7.50 | 7.75 | 7.50 | 
| 298 | Guatemala | 2013 | 7.33 | 7.33 | 6.83 | 7.50 | 
| 388 | Colombia | 2015 | 7.67 | 7.67 | 7.75 | 7.75 | 
| 264 | Colombia | 2013 | 7.67 | 7.42 | 7.33 | 7.50 | 
| 24 | Colombia | 2012 | 7.67 | 7.50 | 7.33 | 7.50 | 
| 137 | Mexico | 2012 | 7.67 | 7.33 | 7.25 | 7.50 | 
| 333 | Colombia | 2014 | 7.67 | 7.67 | 7.67 | 7.92 | 
| 238 | Colombia | 2013 | 7.75 | 7.75 | 7.83 | 7.75 | 
| 6 | Colombia | 2012 | 7.67 | 7.67 | 7.58 | 7.67 | 
Select the top 10 highest entries for a column:
coffee.nlargest(10, "Score.Flavor")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| 38 | Guatemala | 2012 | 8.17 | 8.17 | 7.83 | 7.67 | 
| 39 | Guatemala | 2012 | 7.67 | 8.17 | 7.50 | 7.42 | 
| 109 | Mexico | 2012 | 8.08 | 8.17 | 8.00 | 8.00 | 
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 29 | Colombia | 2012 | 7.42 | 8.00 | 7.83 | 7.75 | 
| 30 | Guatemala | 2012 | 7.83 | 8.00 | 8.00 | 8.17 | 
| 31 | Guatemala | 2012 | 8.00 | 8.00 | 7.83 | 8.33 | 
| 32 | Guatemala | 2012 | 8.25 | 8.00 | 7.67 | 7.92 | 
| 35 | Guatemala | 2012 | 7.67 | 8.00 | 7.67 | 8.00 | 
Select the 10 smallest entries for a column:
coffee.nsmallest(10, "Score.Aroma")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 234 | Mexico | 2012 | 6.50 | 6.67 | 6.42 | 7.17 | 
| 306 | Guatemala | 2013 | 6.50 | 6.33 | 6.50 | 7.50 | 
| 70 | Guatemala | 2012 | 6.67 | 6.50 | 6.33 | 7.00 | 
| 71 | Guatemala | 2012 | 6.75 | 6.50 | 6.17 | 7.00 | 
| 72 | Guatemala | 2012 | 6.75 | 6.67 | 6.17 | 7.17 | 
| 176 | Mexico | 2012 | 6.92 | 7.17 | 7.08 | 7.50 | 
| 194 | Mexico | 2012 | 6.92 | 7.17 | 7.17 | 7.17 | 
| 211 | Mexico | 2012 | 6.92 | 6.92 | 6.67 | 7.25 | 
| 231 | Mexico | 2012 | 6.92 | 6.42 | 6.17 | 7.33 | 
| 233 | Mexico | 2012 | 6.92 | 6.92 | 6.92 | 6.92 | 
Reshaping Data
You will often find yourself needing to change the layout of your data. This can include sorting, reindexing (i.e., adding or changing indices), renaming columns, concatenating, pivoting, and melting data frames.
Sorting Rows and Columns
Sorting a data frame by a specific column:
coffee.sort_values("Score.Aftertaste")| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 231 | Mexico | 2012 | 6.92 | 6.42 | 6.17 | 7.33 | 
| 66 | Guatemala | 2012 | 7.17 | 6.08 | 6.17 | 7.25 | 
| 72 | Guatemala | 2012 | 6.75 | 6.67 | 6.17 | 7.17 | 
| 67 | Guatemala | 2012 | 7.17 | 6.17 | 6.17 | 7.08 | 
| 71 | Guatemala | 2012 | 6.75 | 6.50 | 6.17 | 7.00 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 237 | Colombia | 2013 | 8.00 | 8.00 | 8.00 | 8.17 | 
| 387 | Colombia | 2015 | 7.75 | 7.67 | 8.00 | 7.67 | 
| 30 | Guatemala | 2012 | 7.83 | 8.00 | 8.00 | 8.17 | 
| 109 | Mexico | 2012 | 8.08 | 8.17 | 8.00 | 8.00 | 
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
415 rows × 6 columns
The default sort order is ascending. If you want descending order, you must ask for it:
coffee.sort_values("Score.Aftertaste", ascending=False)| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| 237 | Colombia | 2013 | 8.00 | 8.00 | 8.00 | 8.17 | 
| 30 | Guatemala | 2012 | 7.83 | 8.00 | 8.00 | 8.17 | 
| 387 | Colombia | 2015 | 7.75 | 7.67 | 8.00 | 7.67 | 
| 109 | Mexico | 2012 | 8.08 | 8.17 | 8.00 | 8.00 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 231 | Mexico | 2012 | 6.92 | 6.42 | 6.17 | 7.33 | 
| 71 | Guatemala | 2012 | 6.75 | 6.50 | 6.17 | 7.00 | 
| 66 | Guatemala | 2012 | 7.17 | 6.08 | 6.17 | 7.25 | 
| 67 | Guatemala | 2012 | 7.17 | 6.17 | 6.17 | 7.08 | 
| 72 | Guatemala | 2012 | 6.75 | 6.67 | 6.17 | 7.17 | 
415 rows × 6 columns
You can sort values based on multiple columns as well. In addition, you can specify multiple values to the ascending option so that you can have some columns sorted ascending and others sorted descending. Let’s first sort by ascending year, and then by descending flavor (that is within a year, put the best scoring coffees at the top).
coffee.sort_values(["Year", "Score.Flavor"], ascending=[True, False])| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|---|---|
| 74 | Mexico | 2012 | 8.17 | 8.25 | 8.17 | 8.00 | 
| 38 | Guatemala | 2012 | 8.17 | 8.17 | 7.83 | 7.67 | 
| 39 | Guatemala | 2012 | 7.67 | 8.17 | 7.50 | 7.42 | 
| 109 | Mexico | 2012 | 8.08 | 8.17 | 8.00 | 8.00 | 
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
| 408 | Guatemala | 2015 | 7.50 | 7.42 | 7.17 | 7.83 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
415 rows × 6 columns
Finally, you can sort based on the names of row and column indices. Here is an example that sorts by the names of the columns.
coffee.sort_index(axis="columns")| Country | Score.Acidity | Score.Aftertaste | Score.Aroma | Score.Flavor | Year | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 7.75 | 7.75 | 7.83 | 8.00 | 2012 | 
| 1 | Colombia | 7.75 | 7.83 | 7.75 | 7.92 | 2012 | 
| 2 | Colombia | 8.00 | 7.83 | 7.67 | 7.83 | 2012 | 
| 3 | Colombia | 7.75 | 7.75 | 7.42 | 7.67 | 2012 | 
| 4 | Colombia | 7.75 | 7.75 | 7.75 | 7.75 | 2012 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 7.25 | 7.00 | 7.50 | 7.25 | 2015 | 
| 411 | Guatemala | 7.42 | 7.00 | 7.50 | 7.33 | 2015 | 
| 412 | Guatemala | 7.25 | 6.75 | 7.25 | 7.17 | 2015 | 
| 413 | Mexico | 7.42 | 7.42 | 7.75 | 7.67 | 2015 | 
| 414 | Mexico | 7.50 | 7.33 | 7.58 | 7.50 | 2015 | 
415 rows × 6 columns
In pandas, every row and column needs a unique identifier called an index. While pandas will automatically assign numeric indexes if none are provided, you can also specify custom row and column names to better organize and access your data.
Renaming Columns
Renaming columns is very useful for dealing with your colleague’s messy data! You will learn this one like the back of your hand.
coffee.rename(
    columns={
        "Score.Acidity": "Acidity",
        "Score.Aftertaste": "Aftertaste",
        "Score.Aroma": "Aroma",
        "Score.Flavor": "Flavor",
    }
)| Country | Year | Aroma | Flavor | Aftertaste | Acidity | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 6 columns
In cases like this, where there is a pattern in the column names that you want to get rid of, you can use a Python function.
coffee.rename(columns=lambda name: name.replace("Score.", ""))| Country | Year | Aroma | Flavor | Aftertaste | Acidity | |
|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 6 columns
Nice!
Dropping Columns
Dropping columns is easy with drop():
coffee.drop(columns=["Country", "Year"])| Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | |
|---|---|---|---|---|
| 0 | 7.83 | 8.00 | 7.75 | 7.75 | 
| 1 | 7.75 | 7.92 | 7.83 | 7.75 | 
| 2 | 7.67 | 7.83 | 7.83 | 8.00 | 
| 3 | 7.42 | 7.67 | 7.75 | 7.75 | 
| 4 | 7.75 | 7.75 | 7.75 | 7.75 | 
| ... | ... | ... | ... | ... | 
| 410 | 7.50 | 7.25 | 7.00 | 7.25 | 
| 411 | 7.50 | 7.33 | 7.00 | 7.42 | 
| 412 | 7.25 | 7.17 | 6.75 | 7.25 | 
| 413 | 7.75 | 7.67 | 7.42 | 7.42 | 
| 414 | 7.58 | 7.50 | 7.33 | 7.50 | 
415 rows × 4 columns
Reindexing
“Reindexing” sounds a bit weird, but it is all about editing row and column names. A common operation involves turning row names into an explicit column, or turning a column into row names. Let’s go back to our tiny data frame again.
df = pd.DataFrame(
    {"A": [1, 3, 5, 6], "B": [2, 1, 4, 7], "C": [4, 3, 3, 1], "D": [9, 7, 4, 2]},
    index=["Patient 1", "Patient 2", "Patient 3", "Patient 4"],
)
df| A | B | C | D | |
|---|---|---|---|---|
| Patient 1 | 1 | 2 | 4 | 9 | 
| Patient 2 | 3 | 1 | 3 | 7 | 
| Patient 3 | 5 | 4 | 3 | 4 | 
| Patient 4 | 6 | 7 | 1 | 2 | 
Currently, the row names are the patient IDs. Let’s change those to a column called "PatientID":
df = df.reset_index(names="PatientID")
df| PatientID | A | B | C | D | |
|---|---|---|---|---|---|
| 0 | Patient 1 | 1 | 2 | 4 | 9 | 
| 1 | Patient 2 | 3 | 1 | 3 | 7 | 
| 2 | Patient 3 | 5 | 4 | 3 | 4 | 
| 3 | Patient 4 | 6 | 7 | 1 | 2 | 
Now the row index is the numbers 0-3. If we want to turn the PatientID column back into the row index, we use set_index() like this:
df = df.set_index("PatientID")
df| A | B | C | D | |
|---|---|---|---|---|
| PatientID | ||||
| Patient 1 | 1 | 2 | 4 | 9 | 
| Patient 2 | 3 | 1 | 3 | 7 | 
| Patient 3 | 5 | 4 | 3 | 4 | 
| Patient 4 | 6 | 7 | 1 | 2 | 
There are times when having a row index can be convenient, but generally I have data as columns only as it can make for cleaner data manipulation.
Concatenating
Sometimes you have multiple data frames that you want to glue together. This can be done either by glueing rows and keeping columns the same, or by glueing columns and keeping rows the same.
To see it in action, let’s make two small data frames. One for group A:
group_a = pd.DataFrame({"Treatment": [1, 2, 3], "Control": [4, 5, 6]})
group_a| Treatment | Control | |
|---|---|---|
| 0 | 1 | 4 | 
| 1 | 2 | 5 | 
| 2 | 3 | 6 | 
And one for group B:
group_b = pd.DataFrame({"Treatment": [10, 20, 30], "Control": [40, 50, 60]})
group_b| Treatment | Control | |
|---|---|---|
| 0 | 10 | 40 | 
| 1 | 20 | 50 | 
| 2 | 30 | 60 | 
Glue Rows
Glue the rows (this also works for more than two data frames):
pd.concat([group_a, group_b])| Treatment | Control | |
|---|---|---|
| 0 | 1 | 4 | 
| 1 | 2 | 5 | 
| 2 | 3 | 6 | 
| 0 | 10 | 40 | 
| 1 | 20 | 50 | 
| 2 | 30 | 60 | 
That’s fine, but you probably still want the group info. You can add in the column before concatenating:
pd.concat(
    [
        group_a.assign(Group="A"),
        group_b.assign(Group="B"),
    ]
)| Treatment | Control | Group | |
|---|---|---|---|
| 0 | 1 | 4 | A | 
| 1 | 2 | 5 | A | 
| 2 | 3 | 6 | A | 
| 0 | 10 | 40 | B | 
| 1 | 20 | 50 | B | 
| 2 | 30 | 60 | B | 
There we go! Now we haven’t lost the info.
Glue Columns
Glue the columns:
print(pd.concat([group_a, group_b], axis="columns"))   Treatment  Control  Treatment  Control
0          1        4         10       40
1          2        5         20       50
2          3        6         30       60Now that is a little silly to glue the columns in this way since it results in duplicate column names. You might want to follow it with a rename operation. Let’s pretend that the two data frames represent two study groups, each having values for treatment and control. Then you might do something like this:
pd.concat(
    [
        group_a.add_prefix("GroupA"),
        group_b.add_prefix("GroupB"),
    ],
    axis="columns",
)| GroupATreatment | GroupAControl | GroupBTreatment | GroupBControl | |
|---|---|---|---|---|
| 0 | 1 | 4 | 10 | 40 | 
| 1 | 2 | 5 | 20 | 50 | 
| 2 | 3 | 6 | 30 | 60 | 
Nice!
Pivoting and Melting
Alright, now this can get a little bit tricky. These operations are sometimes called pivot wider and pivot longer. Let’s just talk about the basics, but know that you can get pretty fancy with this if required.
(We will use the renamed columns so the output looks a little nicer.)
coffee_renamed = coffee.rename(columns=lambda name: name.replace("Score.", ""))
coffee_tidy = coffee_renamed.melt(
    # These variables identify observations
    id_vars=["Country", "Year"],
    # The column name for the variables
    var_name="Category",
    # The column name for the values of those variables
    value_name="Score",
)
coffee_tidy| Country | Year | Category | Score | |
|---|---|---|---|---|
| 0 | Colombia | 2012 | Aroma | 7.83 | 
| 1 | Colombia | 2012 | Aroma | 7.75 | 
| 2 | Colombia | 2012 | Aroma | 7.67 | 
| 3 | Colombia | 2012 | Aroma | 7.42 | 
| 4 | Colombia | 2012 | Aroma | 7.75 | 
| ... | ... | ... | ... | ... | 
| 1655 | Guatemala | 2015 | Acidity | 7.25 | 
| 1656 | Guatemala | 2015 | Acidity | 7.42 | 
| 1657 | Guatemala | 2015 | Acidity | 7.25 | 
| 1658 | Mexico | 2015 | Acidity | 7.42 | 
| 1659 | Mexico | 2015 | Acidity | 7.50 | 
1660 rows × 4 columns
Certain operations require tidy data or are more natural when the data is in this format. For example, making plots using seaborn.
import seaborn as sns
sns.boxplot(coffee_tidy.sort_values("Category"), x="Category", y="Score", hue="Country")
(Converting the coffee_tidy data frame back to the messy form is a bit trickier because Country-Year pairs don’t form unique pairs. So we will use a different example for this.)
df = pd.DataFrame(
    {
        "Group": ["A", "A", "A", "A", "B", "B", "B", "B"],
        "Condition": [
            "Treatment",
            "Treatment",
            "Control",
            "Control",
            "Treatment",
            "Treatment",
            "Control",
            "Control",
        ],
        "Replicate": [1, 2, 1, 2, 1, 2, 1, 2],
        "Result": [53.6, 57.2, 66.3, 61.4, 48.6, 49.2, 63.5, 67.8],
    }
)
df| Group | Condition | Replicate | Result | |
|---|---|---|---|---|
| 0 | A | Treatment | 1 | 53.6 | 
| 1 | A | Treatment | 2 | 57.2 | 
| 2 | A | Control | 1 | 66.3 | 
| 3 | A | Control | 2 | 61.4 | 
| 4 | B | Treatment | 1 | 48.6 | 
| 5 | B | Treatment | 2 | 49.2 | 
| 6 | B | Control | 1 | 63.5 | 
| 7 | B | Control | 2 | 67.8 | 
Now, we use pivot() to convert it back to the so-called “messy” format:
df_messy = df.pivot(
    index=["Group", "Replicate"],
    columns="Condition",
    values="Result",
).reset_index()
df_messy.columns.name = None
df_messy| Group | Replicate | Control | Treatment | |
|---|---|---|---|---|
| 0 | A | 1 | 66.3 | 53.6 | 
| 1 | A | 2 | 61.4 | 57.2 | 
| 2 | B | 1 | 63.5 | 48.6 | 
| 3 | B | 2 | 67.8 | 49.2 | 
To go back to the tidy format:
df_tidy = df_messy.melt(
    id_vars=["Group", "Replicate"],
    var_name="Condition",
    value_name="Result",
)
df_tidy| Group | Replicate | Condition | Result | |
|---|---|---|---|---|
| 0 | A | 1 | Control | 66.3 | 
| 1 | A | 2 | Control | 61.4 | 
| 2 | B | 1 | Control | 63.5 | 
| 3 | B | 2 | Control | 67.8 | 
| 4 | A | 1 | Treatment | 53.6 | 
| 5 | A | 2 | Treatment | 57.2 | 
| 6 | B | 1 | Treatment | 48.6 | 
| 7 | B | 2 | Treatment | 49.2 | 
Check it out:
a = (
    df.sort_values(by=["Group", "Condition", "Replicate"])
    .sort_index(axis="columns")
    .reset_index(drop=True)
)
display(a)
b = (
    df_tidy.sort_values(by=["Group", "Condition", "Replicate"])
    .sort_index(axis="columns")
    .reset_index(drop=True)
)
display(b)
assert a.equals(b)| Condition | Group | Replicate | Result | |
|---|---|---|---|---|
| 0 | Control | A | 1 | 66.3 | 
| 1 | Control | A | 2 | 61.4 | 
| 2 | Treatment | A | 1 | 53.6 | 
| 3 | Treatment | A | 2 | 57.2 | 
| 4 | Control | B | 1 | 63.5 | 
| 5 | Control | B | 2 | 67.8 | 
| 6 | Treatment | B | 1 | 48.6 | 
| 7 | Treatment | B | 2 | 49.2 | 
| Condition | Group | Replicate | Result | |
|---|---|---|---|---|
| 0 | Control | A | 1 | 66.3 | 
| 1 | Control | A | 2 | 61.4 | 
| 2 | Treatment | A | 1 | 53.6 | 
| 3 | Treatment | A | 2 | 57.2 | 
| 4 | Control | B | 1 | 63.5 | 
| 5 | Control | B | 2 | 67.8 | 
| 6 | Treatment | B | 1 | 48.6 | 
| 7 | Treatment | B | 2 | 49.2 | 
Note: we use display() here rather than print to get nice looking tables when we need to “print” multiple items in a single codeblock.
This is a bit of a trickier topic, but I still wanted to give you an idea of how it works for when you run into it in your own research.
Making New Columns
There are a couple of ways to make new columns in a data frame.
The first way is with assign() which creates a new data frame with the requested column added to it.
Let’s say that you wanted to make a score summary that is a linear combination of a few of the score columns.
coffee.assign(
    Score=lambda df: 1.5 * df["Score.Aroma"]
    + 2 * df["Score.Flavor"]
    + 0.75 * df["Score.Aftertaste"]
)| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | Score | |
|---|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 33.5575 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 33.3375 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 33.0375 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 32.2825 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 32.9375 | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 31.0000 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 31.1600 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 30.2775 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 32.5300 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 31.8675 | 
415 rows × 7 columns
If you check the original data frame, you will see that the Score column that we just created is not there:
try:
    coffee["Score"]
except KeyError:
    print("no Score column!")no Score column!If you want to actually save that column in your original data frame, you need to use the bracket notation again.
coffee["Score"] = (
    1.5 * coffee["Score.Aroma"]
    + 2 * coffee["Score.Flavor"]
    + 0.75 * coffee["Score.Aftertaste"]
)
coffee| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | Score | |
|---|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | 33.5575 | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | 33.3375 | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | 33.0375 | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | 32.2825 | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | 32.9375 | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | 31.0000 | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | 31.1600 | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | 30.2775 | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | 32.5300 | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | 31.8675 | 
415 rows × 7 columns
And now the Score column is part of the coffee data frame. Let’s drop it out so it doesn’t stick around for the rest of the chapter.
coffee = coffee.drop(columns="Score")Grouping Data
Grouping and aggregating data is a very common operation. For example, you might want to run aggregation functions on groups of data like, what’s the mean Flavor score for countries in this data set?
coffee.groupby("Country").agg({"Score.Flavor": "mean"})
# coffee.groupby("Country").agg({"Score.Aroma": ["mean", "min", "mean"]})| Score.Flavor | |
|---|---|
| Country | |
| Colombia | 7.602347 | 
| Guatemala | 7.461803 | 
| Mexico | 7.374513 | 
You can run the aggregations on multiple columns at once. Here we take the mean flavor and the sum of the aroma. (Yes, that’s a weird thing to do, it’s just here to show you that you don’t have to run the same function for each column.)
coffee.groupby("Country").agg({"Score.Flavor": "mean", "Score.Aroma": "sum"})| Score.Flavor | Score.Aroma | |
|---|---|---|
| Country | ||
| Colombia | 7.602347 | 752.53 | 
| Guatemala | 7.461803 | 920.60 | 
| Mexico | 7.374513 | 1454.48 | 
You can even run multiple aggregation functions on each column:
coffee.groupby("Country").agg(
    {"Score.Flavor": ["mean", "min", "max"], "Score.Aroma": "sum"}
)| Score.Flavor | Score.Aroma | |||
|---|---|---|---|---|
| mean | min | max | sum | |
| Country | ||||
| Colombia | 7.602347 | 7.25 | 8.00 | 752.53 | 
| Guatemala | 7.461803 | 6.08 | 8.17 | 920.60 | 
| Mexico | 7.374513 | 6.42 | 8.25 | 1454.48 | 
In many cases you need to group based on multiple columns. In this dataset, it makes sense to group by Country-Year pairs:
coffee.groupby(["Country", "Year"]).agg(
    {"Score.Flavor": ["mean", "std"], "Score.Aroma": ["mean", "std"]}
)| Score.Flavor | Score.Aroma | ||||
|---|---|---|---|---|---|
| mean | std | mean | std | ||
| Country | Year | ||||
| Colombia | 2012 | 7.621667 | 0.166900 | 7.637333 | 0.124677 | 
| 2013 | 7.546579 | 0.184028 | 7.646053 | 0.170493 | |
| 2014 | 7.617333 | 0.128478 | 7.728667 | 0.142672 | |
| 2015 | 7.690000 | 0.115388 | 7.795333 | 0.213972 | |
| Guatemala | 2012 | 7.401364 | 0.501006 | 7.508182 | 0.363237 | 
| ... | ... | ... | ... | ... | |
| 2015 | 7.595714 | 0.230508 | 7.625000 | 0.172571 | |
| Mexico | 2012 | 7.370741 | 0.324318 | 7.457284 | 0.275460 | 
| 2013 | 7.369500 | 0.236609 | 7.486500 | 0.161156 | |
| 2014 | 7.400909 | 0.234668 | 7.394545 | 0.239097 | |
| 2015 | 7.585000 | 0.120208 | 7.665000 | 0.120208 | |
12 rows × 4 columns
That would be the mean and standard deviation of flavor and aroma for all country-year pairs.
Combining Data Sets
Another useful bit of functionality is merging datasets that have overlapping columns. These is a lot like doing joins in a relational database. While Pandas does have a join() function, we are mainly going to be using merge() instead, as it doesn’t require that the tables have meaningful row indices.
Let’s say we have a data frame that includes useful information about the countries for which we have coffee data. (Okay, we’re not using necessarily “useful” data in this example, but work with me a bit.)
country_capitals = pd.DataFrame(
    {
        "Country": ["Colombia", "Guatemala", "Mexico"],
        "Capital": ["Bogotá", "Guatemala City", "Mexico City"],
    }
)
country_capitals| Country | Capital | |
|---|---|---|
| 0 | Colombia | Bogotá | 
| 1 | Guatemala | Guatemala City | 
| 2 | Mexico | Mexico City | 
Okay, now let’s say we wanted that info in our coffee data frame. We can join them using merge().
coffee.merge(country_capitals)| Country | Year | Score.Aroma | Score.Flavor | Score.Aftertaste | Score.Acidity | Capital | |
|---|---|---|---|---|---|---|---|
| 0 | Colombia | 2012 | 7.83 | 8.00 | 7.75 | 7.75 | Bogotá | 
| 1 | Colombia | 2012 | 7.75 | 7.92 | 7.83 | 7.75 | Bogotá | 
| 2 | Colombia | 2012 | 7.67 | 7.83 | 7.83 | 8.00 | Bogotá | 
| 3 | Colombia | 2012 | 7.42 | 7.67 | 7.75 | 7.75 | Bogotá | 
| 4 | Colombia | 2012 | 7.75 | 7.75 | 7.75 | 7.75 | Bogotá | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 410 | Guatemala | 2015 | 7.50 | 7.25 | 7.00 | 7.25 | Guatemala City | 
| 411 | Guatemala | 2015 | 7.50 | 7.33 | 7.00 | 7.42 | Guatemala City | 
| 412 | Guatemala | 2015 | 7.25 | 7.17 | 6.75 | 7.25 | Guatemala City | 
| 413 | Mexico | 2015 | 7.75 | 7.67 | 7.42 | 7.42 | Mexico City | 
| 414 | Mexico | 2015 | 7.58 | 7.50 | 7.33 | 7.50 | Mexico City | 
415 rows × 7 columns
There is actually a lot to the merge() function which we won’t get into too much here, but one important thing is how missing data is handled between data frames. Let’s make some more data to look at this.
First, a data frame for jobs:
jobs = pd.DataFrame(
    {
        "Name": ["Rahaf", "Nanjin", "Lujain", "Lovisa"],
        "Job": ["Chef", "Courier", "Engineer", "Bookmaker"],
    }
)
jobs| Name | Job | |
|---|---|---|
| 0 | Rahaf | Chef | 
| 1 | Nanjin | Courier | 
| 2 | Lujain | Engineer | 
| 3 | Lovisa | Bookmaker | 
And another one for ages:
ages = pd.DataFrame(
    {
        "Name": ["Rahaf", "Nanjin", "Lujain", "Lovisa"],
        "Age": ["47", "26", "31", "61"],
    }
)
ages| Name | Age | |
|---|---|---|
| 0 | Rahaf | 47 | 
| 1 | Nanjin | 26 | 
| 2 | Lujain | 31 | 
| 3 | Lovisa | 61 | 
To demonstrate the joins, we will need to take subsets of the data frames in which some of the rows overlap and some of them don’t:
jobs_subset = jobs.head(3)
ages_subset = ages.tail(3)
display(jobs_subset)
display(ages_subset)| Name | Job | |
|---|---|---|
| 0 | Rahaf | Chef | 
| 1 | Nanjin | Courier | 
| 2 | Lujain | Engineer | 
| Name | Age | |
|---|---|---|
| 1 | Nanjin | 26 | 
| 2 | Lujain | 31 | 
| 3 | Lovisa | 61 | 
Inner Join
The inner join keeps rows that only belong to both sets of data. (This is like the inner join in SQL.)
jobs_subset.merge(ages_subset, how="inner")| Name | Job | Age | |
|---|---|---|---|
| 0 | Nanjin | Courier | 26 | 
| 1 | Lujain | Engineer | 31 | 
Outer Join
The outer join keeps all the rows, even if some of the rows are only present in one data frame or the other. Missing values will be NaN by default. (This is like the full outer join in SQL.)
jobs_subset.merge(ages_subset, how="outer")| Name | Job | Age | |
|---|---|---|---|
| 0 | Lovisa | NaN | 61 | 
| 1 | Lujain | Engineer | 31 | 
| 2 | Nanjin | Courier | 26 | 
| 3 | Rahaf | Chef | NaN | 
Left Join
The left join keeps all the rows in the left data frame, even if they are not present in the right data frame. Missing values will be NaN by default. (This is like the left outer join in SQL.)
jobs_subset.merge(ages_subset, how="left")| Name | Job | Age | |
|---|---|---|---|
| 0 | Rahaf | Chef | NaN | 
| 1 | Nanjin | Courier | 26 | 
| 2 | Lujain | Engineer | 31 | 
Right Join
The right join keeps all the rows in the right data frame, even if they are not present in the left data frame. Missing values will be NaN by default. (This is like the right outer join in SQL.)
jobs_subset.merge(ages_subset, how="right")| Name | Job | Age | |
|---|---|---|---|
| 0 | Nanjin | Courier | 26 | 
| 1 | Lujain | Engineer | 31 | 
| 2 | Lovisa | NaN | 61 | 
When merging datasets, what factors should you consider when choosing between inner, outer, left, and right joins?
Pandas Summary
Pandas is a huge and fairly complex library. But you can of get a lot of real work done by getting comfortable with a fairly small subset, and then working your way through more advanced concepts over time.
7.7 Wrap-Up
In this chapter, we went over the basics of exploratory data analysis (EDA) in Python using the Pandas library. We went through three examples datasets together letting our curiosity guide us. Then, we went through a big tour of the Pandas library for you to use as a reference to some of the most frequently used Pandas functions. Pandas is a big library with a lot of different functions and different ways to do things. This chapter should give you enough Pandas skills to start using it in your own research, and give you the basics you need to get more in-depth with the library later in your career.
7.8 Suggested Readings
Pandas is a massive library. These resources can help get a handle on it:
7.9 Practice Problems
Consider the following data when answering the following problems.
state_cancer_data = {
    "State": [
        "Delaware",
        "Maryland",
        "Virginia",
        "Pennsylvania",
        "New York",
        "New Jersey",
    ],
    "Cancer Deaths": [
        13_000,
        72_000,
        99_000,
        202_000,
        249_000,
        117_000,
    ],
    "Population": [
        6_300_000,
        40_500_000,
        56_100_000,
        88_800_000,
        135_700_000,
        61_500_000,
    ],
    "Percent Aged 65+": [
        19,
        16,
        16,
        19,
        17,
        17,
    ],
    "Median Household Income": [
        68_000,
        85_000,
        74_000,
        62_000,
        69_000,
        83_000,
    ],
}7.1
Create a Pandas DataFrame to represent the state cancer and demographic data given above.
7.2
Create a new column called Cancer Deaths Per 100k that represents the number of cancer cases per 100,000 people for each state.
7.3
Which states have a cancer rate of at least 180 cases per 100k and a median household income less than $68,500?
7.4
What is the relationship between median household income and the rate of cancer deaths?
7.5
What is the relationship between the percent of the population aged 65 and older and the rate of cancer deaths?
