Data management and exploration in pandas
This section provides a brief introduction to pandas. The pandas library is a key component for doing data science in Python for a couple of reasons. Most importantly, it provides two data types, series and data frame, that allow you to store and manipulate data in a way that is useful for analysis. Second, it is incredibly useful for importing and exporting data in a wide variety of formats. Finally, it makes descriptive analysis, including both summary statistics and visualizations. This section provides an introduction to the main capabilities of pandas relevant to data analysis.
Most of the things that you will want to do in Python require importing libraries. By convention, pandas is imported as pd
. Additionally, we enable the ability for pandas graphics to be displayed within the notebook with %matplotlib inline
.
import pandas as pd
%matplotlib inline
Reading data
In the summer of 2017, the Washington Post produced a report on murder clearance rates in U.S. cities. They also released the data they collected on Github as a csv file. We can create a new dataframe, called df
, using the pandas read_csv
method.
df = pd.read_csv('data/homicide.csv')
df
uid | reported_date | victim_last | victim_first | victim_race | victim_age | victim_sex | city | state | lat | lon | disposition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alb-000001 | 20100504 | GARCIA | JUAN | Hispanic | 78.0 | Male | Albuquerque | NM | 35.095788 | -106.538555 | Closed without arrest |
1 | Alb-000002 | 20100216 | MONTOYA | CAMERON | Hispanic | 17.0 | Male | Albuquerque | NM | 35.056810 | -106.715321 | Closed by arrest |
2 | Alb-000003 | 20100601 | SATTERFIELD | VIVIANA | White | 15.0 | Female | Albuquerque | NM | 35.086092 | -106.695568 | Closed without arrest |
3 | Alb-000004 | 20100101 | MENDIOLA | CARLOS | Hispanic | 32.0 | Male | Albuquerque | NM | 35.078493 | -106.556094 | Closed by arrest |
4 | Alb-000005 | 20100102 | MULA | VIVIAN | White | 72.0 | Female | Albuquerque | NM | 35.130357 | -106.580986 | Closed without arrest |
5 | Alb-000006 | 20100126 | BOOK | GERALDINE | White | 91.0 | Female | Albuquerque | NM | 35.151110 | -106.537797 | Open/No arrest |
6 | Alb-000007 | 20100127 | MALDONADO | DAVID | Hispanic | 52.0 | Male | Albuquerque | NM | 35.111785 | -106.712614 | Closed by arrest |
7 | Alb-000008 | 20100127 | MALDONADO | CONNIE | Hispanic | 52.0 | Female | Albuquerque | NM | 35.111785 | -106.712614 | Closed by arrest |
8 | Alb-000009 | 20100130 | MARTIN-LEYVA | GUSTAVO | White | 56.0 | Male | Albuquerque | NM | 35.075380 | -106.553458 | Open/No arrest |
9 | Alb-000010 | 20100210 | HERRERA | ISRAEL | Hispanic | 43.0 | Male | Albuquerque | NM | 35.065930 | -106.572288 | Open/No arrest |
10 | Alb-000011 | 20100212 | BARRIUS-CAMPANIONI | HECTOR | Hispanic | 20.0 | Male | Albuquerque | NM | 35.077375 | -106.560569 | Closed by arrest |
11 | Alb-000012 | 20100218 | LUJAN | KEVIN | White | NaN | Male | Albuquerque | NM | 35.077011 | -106.564910 | Closed without arrest |
12 | Alb-000013 | 20100222 | COLLAMORE | JOHN | Hispanic | 46.0 | Male | Albuquerque | NM | 35.064076 | -106.608281 | Open/No arrest |
13 | Alb-000014 | 20100306 | CHIQUITO | CORIN | Other | 16.0 | Male | Albuquerque | NM | 35.041491 | -106.742147 | Closed by arrest |
14 | Alb-000015 | 20100308 | TORRES | HECTOR | Hispanic | 54.0 | Male | Albuquerque | NM | 35.070656 | -106.615845 | Closed by arrest |
15 | Alb-000016 | 20100308 | GRAY | STEFANIA | White | 43.0 | Female | Albuquerque | NM | 35.070656 | -106.615845 | Closed by arrest |
16 | Alb-000017 | 20100322 | LEYVA | JOEL | Hispanic | 52.0 | Male | Albuquerque | NM | 35.079082 | -106.495949 | Closed by arrest |
17 | Alb-000018 | 20100323 | DAVID | LARRY | White | 52.0 | Male | Albuquerque | NM | NaN | NaN | Closed by arrest |
18 | Alb-000019 | 20100402 | BRITO | ELIZABETH | White | 22.0 | Female | Albuquerque | NM | 35.110404 | -106.523668 | Closed by arrest |
19 | Alb-000020 | 20100420 | CHAVEZ | GREG SR. | Hispanic | 49.0 | Male | Albuquerque | NM | 35.089573 | -106.570078 | Closed by arrest |
20 | Alb-000021 | 20100423 | KING | TEVION | Black | 15.0 | Male | Albuquerque | NM | 35.095887 | -106.638081 | Closed by arrest |
21 | Alb-000022 | 20100423 | BOYKIN | CEDRIC | Black | 25.0 | Male | Albuquerque | NM | 35.095887 | -106.638081 | Closed by arrest |
22 | Alb-000023 | 20100518 | BARRAGAN | MIGUEL | White | 20.0 | Male | Albuquerque | NM | 35.083401 | -106.632941 | Closed by arrest |
23 | Alb-000024 | 20100602 | FORD | LUTHER | Other | 47.0 | Male | Albuquerque | NM | 35.102423 | -106.567674 | Closed by arrest |
24 | Alb-000025 | 20100602 | WRONSKI | VIOLA | White | 88.0 | Female | Albuquerque | NM | 35.095956 | -106.561135 | Closed without arrest |
25 | Alb-000026 | 20100603 | ASHFORD | GUADALUPE | Hispanic | 27.0 | Female | Albuquerque | NM | 35.078462 | -106.551755 | Closed by arrest |
26 | Alb-000027 | 20100712 | TURNER | MICHELLE | White | 36.0 | Female | Albuquerque | NM | 35.053748 | -106.531800 | Closed without arrest |
27 | Alb-000028 | 20100712 | CUNNINGHAM | SHARON | White | 47.0 | Female | Albuquerque | NM | 35.053748 | -106.531800 | Closed without arrest |
28 | Alb-000029 | 20100726 | NGUYEN | SELENAVI | Asian | 1.0 | Female | Albuquerque | NM | 35.064456 | -106.524458 | Closed without arrest |
29 | Alb-000030 | 20100728 | VALDEZ | BILL | Hispanic | 58.0 | Male | Albuquerque | NM | 35.061827 | -106.749104 | Open/No arrest |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52149 | Was-001353 | 20160324 | TURNER | GABRIEL | Black | 46.0 | Male | Washington | DC | 38.857628 | -76.996482 | Closed by arrest |
52150 | Was-001355 | 20160518 | MERIEDY | THOMAS | Black | 24.0 | Male | Washington | DC | 38.862517 | -76.990880 | Open/No arrest |
52151 | Was-001356 | 20160906 | COOK | JOE | Black | 35.0 | Male | Washington | DC | 38.854721 | -76.990795 | Closed by arrest |
52152 | Was-001357 | 20160917 | PHILLIPS | SCORPIO | Black | 31.0 | Male | Washington | DC | 38.877003 | -76.995627 | Open/No arrest |
52153 | Was-001358 | 20160917 | HARRIS | ZORUAN | Black | 18.0 | Male | Washington | DC | 38.877003 | -76.995627 | Open/No arrest |
52154 | Was-001359 | 20161108 | PRATT | ANTINA | Black | 40.0 | Female | Washington | DC | 38.855699 | -76.993248 | Open/No arrest |
52155 | Was-001360 | 20161127 | SILVER | ORLANDO | Black | 37.0 | Male | Washington | DC | 38.860054 | -76.991825 | Closed by arrest |
52156 | Was-001361 | 20160304 | REZENE | NOEL | Black | 26.0 | Male | Washington | DC | 38.849150 | -76.971272 | Closed by arrest |
52157 | Was-001362 | 20160511 | DRAYTON | JAVION | Black | 3.0 | Male | Washington | DC | 38.853961 | -76.984712 | Closed without arrest |
52158 | Was-001363 | 20160702 | MCCULLOUGH | ANTOINE | Black | 30.0 | Male | Washington | DC | 38.843740 | -76.977839 | Closed by arrest |
52159 | Was-001365 | 20161111 | BUIE | SAMUELLE | Black | 25.0 | Male | Washington | DC | 38.851284 | -76.981029 | Open/No arrest |
52160 | Was-001366 | 20161111 | JENNINGS | RASSAAN | Black | 19.0 | Male | Washington | DC | 38.851284 | -76.981029 | Open/No arrest |
52161 | Was-001367 | 20161122 | TYLER | MARQUETT | Black | 26.0 | Male | Washington | DC | 38.849967 | -76.974421 | Open/No arrest |
52162 | Was-001368 | 20160302 | GARRIS | RUDOLPH | Black | 25.0 | Male | Washington | DC | 38.827191 | -76.998183 | Closed by arrest |
52163 | Was-001369 | 20160317 | DANSBURY | AUBREY | Black | 27.0 | Male | Washington | DC | 38.829442 | -76.993175 | Closed by arrest |
52164 | Was-001370 | 20160510 | HANEY | JAQUAN | Black | 22.0 | Male | Washington | DC | 38.833943 | -76.990742 | Open/No arrest |
52165 | Was-001371 | 20160519 | HAMILTON | DANA | Black | 44.0 | Male | Washington | DC | 38.828365 | -76.994431 | Closed by arrest |
52166 | Was-001372 | 20160609 | WRIGHT | RASHAWN | Black | 25.0 | Male | Washington | DC | 38.835947 | -76.990438 | Open/No arrest |
52167 | Was-001373 | 20160626 | BLACKWELL | WESTLEY | Black | 38.0 | Male | Washington | DC | 38.834756 | -76.992662 | Open/No arrest |
52168 | Was-001374 | 20161227 | DOWTIN | HERBERT | Black | 22.0 | Male | Washington | DC | 38.833190 | -76.994345 | Closed by arrest |
52169 | Was-001375 | 20160224 | MEDLAY | DEMETRIUS | Black | 22.0 | Male | Washington | DC | 38.843399 | -77.000104 | Closed by arrest |
52170 | Was-001376 | 20160731 | STEPHENS | JUDONNE | Black | 25.0 | Male | Washington | DC | 38.863322 | -76.995309 | Open/No arrest |
52171 | Was-001377 | 20160916 | PEOPLES | DARNELL | Black | 35.0 | Male | Washington | DC | 38.845871 | -76.998169 | Closed by arrest |
52172 | Was-001378 | 20160415 | IVEY | PAUL | Black | 37.0 | Male | Washington | DC | 38.826458 | -77.003590 | Closed by arrest |
52173 | Was-001379 | 20160715 | HARRIS | SHAROD | Black | 20.0 | Male | Washington | DC | 38.827266 | -77.001572 | Open/No arrest |
52174 | Was-001380 | 20160908 | WILLIAMS | EVAN | Black | 29.0 | Male | Washington | DC | 38.828704 | -77.002075 | Closed by arrest |
52175 | Was-001381 | 20160913 | SMITH | DEON | Black | 19.0 | Male | Washington | DC | 38.822852 | -77.001725 | Open/No arrest |
52176 | Was-001382 | 20161114 | WASHINGTON | WILLIE | Black | 23.0 | Male | Washington | DC | 38.828025 | -77.002511 | Open/No arrest |
52177 | Was-001383 | 20161130 | BARNES | MARCUS | Black | 24.0 | Male | Washington | DC | 38.820476 | -77.008640 | Open/No arrest |
52178 | Was-001384 | 20160901 | JACKSON | KEVIN | Black | 17.0 | Male | Washington | DC | 38.866689 | -76.982409 | Closed by arrest |
52179 rows × 12 columns
If you have the URL of a csv file, you can load it directly:
csv_url = 'https://raw.githubusercontent.com/nealcaren/data/master/sets/homicide.csv'
df = pd.read_csv(csv_url)
Learning about your dataframe
After loading a dataframe, best practice is to get a sense of the data with the head
, info
and describe
methods. head
shows the first five rows of the dataframe.
df.head()
uid | reported_date | victim_last | victim_first | victim_race | victim_age | victim_sex | city | state | lat | lon | disposition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alb-000001 | 20100504 | GARCIA | JUAN | Hispanic | 78.0 | Male | Albuquerque | NM | 35.095788 | -106.538555 | Closed without arrest |
1 | Alb-000002 | 20100216 | MONTOYA | CAMERON | Hispanic | 17.0 | Male | Albuquerque | NM | 35.056810 | -106.715321 | Closed by arrest |
2 | Alb-000003 | 20100601 | SATTERFIELD | VIVIANA | White | 15.0 | Female | Albuquerque | NM | 35.086092 | -106.695568 | Closed without arrest |
3 | Alb-000004 | 20100101 | MENDIOLA | CARLOS | Hispanic | 32.0 | Male | Albuquerque | NM | 35.078493 | -106.556094 | Closed by arrest |
4 | Alb-000005 | 20100102 | MULA | VIVIAN | White | 72.0 | Female | Albuquerque | NM | 35.130357 | -106.580986 | Closed without arrest |
In addition to the data in the csv file, an index has been created to identifiy each row. By default, this is an interger starting with 0.
If the dataset is wide, middle columns will not be displayed. Also, if text fields are long, only the first few characters will be shown. These can both be adjusted using pandas display settings.
info
can be used to explore the data types and the number of non-missing cases for each variable.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52179 entries, 0 to 52178
Data columns (total 12 columns):
uid 52179 non-null object
reported_date 52179 non-null int64
victim_last 52179 non-null object
victim_first 52179 non-null object
victim_race 52179 non-null object
victim_age 49180 non-null float64
victim_sex 52179 non-null object
city 52179 non-null object
state 52179 non-null object
lat 52119 non-null float64
lon 52119 non-null float64
disposition 52179 non-null object
dtypes: float64(3), int64(1), object(8)
memory usage: 4.8+ MB
describe
provides summary statistics for all the numeric variables.
df.describe()
reported_date | victim_age | lat | lon | |
---|---|---|---|---|
count | 5.217900e+04 | 49180.000000 | 52119.000000 | 52119.000000 |
mean | 2.013090e+07 | 31.801220 | 37.026786 | -91.471094 |
std | 1.123420e+06 | 14.418692 | 4.348647 | 13.746378 |
min | 2.007010e+07 | 0.000000 | 25.725214 | -122.507779 |
25% | 2.010032e+07 | 22.000000 | 33.765203 | -95.997198 |
50% | 2.012122e+07 | 28.000000 | 38.524973 | -87.710286 |
75% | 2.015091e+07 | 40.000000 | 40.027627 | -81.755909 |
max | 2.015111e+08 | 102.000000 | 45.051190 | -71.011519 |
The column headers can be extracted using keys
.
df.keys()
Index(['uid', 'reported_date', 'victim_last', 'victim_first', 'victim_race',
'victim_age', 'victim_sex', 'city', 'state', 'lat', 'lon',
'disposition'],
dtype='object')
If you wanted to look at the bottom of the dataframe, you can use tail
. Both head
and tail
allow you to change the number of rows displayed from the default five.
df.tail(3)
uid | reported_date | victim_last | victim_first | victim_race | victim_age | victim_sex | city | state | lat | lon | disposition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
52176 | Was-001382 | 20161114 | WASHINGTON | WILLIE | Black | 23.0 | Male | Washington | DC | 38.828025 | -77.002511 | Open/No arrest |
52177 | Was-001383 | 20161130 | BARNES | MARCUS | Black | 24.0 | Male | Washington | DC | 38.820476 | -77.008640 | Open/No arrest |
52178 | Was-001384 | 20160901 | JACKSON | KEVIN | Black | 17.0 | Male | Washington | DC | 38.866689 | -76.982409 | Closed by arrest |
sample
displays random rows from the dataframe.
df.sample(5)
uid | reported_date | victim_last | victim_first | victim_race | victim_age | victim_sex | city | state | lat | lon | disposition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
28043 | Las-001175 | 20170121 | SANCHEZ | ALBERTO | Hispanic | 24.0 | Male | Las Vegas | NV | 36.189362 | -115.061763 | Closed by arrest |
13424 | Cin-700043 | 20170728 | ROBERTSON | RICKEY | Black | 58.0 | Male | Cincinnati | OH | 39.140126 | -84.476951 | Open/No arrest |
13253 | Cin-000947 | 20140531 | GASSETT | JOSHUA | Black | 24.0 | Male | Cincinnati | OH | 39.162160 | -84.382659 | Closed by arrest |
51164 | Was-000343 | 20130907 | SWEET | KENNETH | Black | 30.0 | Male | Washington | DC | 38.877011 | -77.002498 | Open/No arrest |
16190 | Den-000087 | 20120318 | NEWSON | BILLY | Black | 29.0 | Male | Denver | CO | 39.740021 | -104.979913 | Closed by arrest |
Your turn
Display the first four rows of the dataframe df
.
Sample answer code
df.head(4)
Working with variables
df.head(2)
df['victim_age'].describe()
df['victim_age'].value_counts()
As this has many values, pandas only displays the top and bottom 30 cases. The values
method can be used to produce an array containing all the values in order.
Your turn
Explore the disposition
and victim race
columns in the dataframe.
Sample answer code
df[‘disposition’].value_counts()
df[‘victim_race’].value_counts()
df[[‘disposition’,‘victim_race’]].describe()
All of the values of a specific variables can be extracted.
ages = df['victim_age'].values
len(ages)
first_age = ages[0]
print(first_age)
Your turn
Display seven value from the middle of our age variable.
Sample answer code
ages[1201:1208]
Your turn
A well-known data set is the list of titanic passengers. A version can be found in the data folder called, "titanic.csv". Open the file as a new dataframe titanic_df
. How many cases? How many columns? What can you find out about the data?
Sample answer code
titanic_df = pd.read_csv(‘data/titanic.csv’)
titanic_df.describe()
titanic_df.info()
titanic_df.sample(5)
Plots (optional)
pandas also has plotting capabilies, such as histograms (hist
) and a correlation matrix (scatter_matrix
).
%matplotlib inline
df['victim_age'].hist()
Plot of individual variables, or series in pandas terminology, are attributes of the data type. That is, you start wit the thing you want plotted, in this case df['victim_age']
, and append what you want to do, such as .hist()
.
A second type of plots, such as scatter plots, are methods of the dataframe.
df.plot.scatter(x='lon', y='lat')
You could look at the other dataframe plotting methods on the helpful pandas visualizations page. Alternatively, typing tab after df.plot.
also reveals your options.
Want to know about hexbin
? Again, the help page on the web is useful, but appending a question mark to the end of the command will bring up the documentation.
<img src="images/docstring.png" width = "80%" align="left"/>
A third group of plots are part of the pandas plotting library. In these cases, the thing you want plotted is the first, or only, parameter passed, as is the case with the correlation matrix.
```python
pd.plotting.scatter_matrix(df)
Finally, you can also create subplots using the by
option. Note that by
accepts a series, or dataframe column, rather than a column name.
df['victim_age'].hist(by = df['victim_sex'],
bins = 20)
By default, by
produces separate x and y scales for each subgraph. This is why it appears to be a relatively large number of deaths of very young females. The numbers between men and women at this age are comparable, but the very large number of male deaths in their 20s results in very different xscales for the graphs. This option can be changed with the sharex
or sharey
option.
df['victim_age'].hist(by = df['victim_sex'],
bins = 20,
sharex = True,
sharey = True)
Other descriptives
Pandas also has a method for producing crosstabs.
pd.crosstab(df['victim_race'], df['disposition'])
Note that since this is a pandas method, and not one of a specific dataframe, you need to be explicit about which datatframe each variable is coming from. That is why the first parameter is not 'victim_race'
but df['victim_race']
.
normalize
can be used to display percentages instead of frequencies. A value of index
normalized by row, columns
by column, and all
by all values.
pd.crosstab(df['victim_race'],
df['disposition'],
normalize='index')
Since this returns a dataframe, it can be saved or plotted.
cross_tab = pd.crosstab(df['victim_race'],
df['disposition'],
normalize='index')
cross_tab
cross_tab.to_csv('data/crosstab.csv')
Your turn
In your titanic dataframe, run a crosstab between sex and survived. Anything interesting?
Sample answer code
pd.crosstab(titanic_df[‘Survived’],
titanic_df[‘Sex’], normalize=‘index’)
In order to highlight a meaningful characteristics of the data, you can sort before plotting.
cross_tab.sort_values(by='Closed by arrest')
cross_tab.sort_values(by='Closed by arrest').plot(kind = 'barh',
title = 'Disposition by race')
Subsets
Similar to a list, a dataframe or series can be sliced to subset the data being shown. For example, df[:2]
will return the first two rows of the dataframe. (This is identical to df.head(2)
.)
df[:2]
df.head(2)
This also works for specific columns.
df['reported_date'][:3]
Dates (optional)
A new variable can be created from reported_date
that pandas understands is a date variable using the to_datetime
method. The format is %Y%m%d
because the original date is in the “YYYMMDD” format, and coerce
places missing values where the data can be translated, rather than stopping the variable creation completely.
df['reported_date'].head()
df['date'] = pd.to_datetime(df['reported_date'],
format='%Y%m%d',
errors='coerce')
df['date'][:3]
From the new series, we can extract specific elements, such as the year.
df['year'] = df['date'].dt.year
As before, value_counts
and plots can give some sense of the distribution of the values.
df['year'].value_counts()
Value counts returns a pandas series with an index equal to the original values, in the case the year, and the series values based on the frequency. Since years have an order, it makes sense to sort by the index before plotting them.
df['year'].value_counts().sort_index(ascending = False).plot(kind='barh')
crosstab
can also group based on more than one variable for the x or y axis. In that case, you pass a list rather than a single variable or series. To make this clearer, you can create the lists before creating the crosstab.
y_vars = [df['state'], df['city']]
x_vars = df['year']
pd.crosstab(y_vars, x_vars)
Crosstab returns a dataframe with the column and index names from the values in the original dataset. Since a list was passed, the datatframe has a MultiIndex
. The can be useful for cases where you have nested data, like cities with states or annual data on multiple countries.
pd.crosstab(y_vars, x_vars).index.names
Index
df.head()
By default, the index is a series that starts with 0. If your data has unique identifiers, however, it is helpful to use that as the index, especially if you intend on merging your data with other data sources. In this dataframe, each row has a unique value for uid
.
df.set_index('uid', inplace=True)
df[:5]
Your turn
In your Titanic dataframe, set the index to the PassengerId
column. Confirm that it did want you wanted it to do.
Sample answer code
titanic_df.set_index(‘PassengerId’, inplace=True)
titanic_df.sample(3)
Subseting
You can view a subset of a dataframe based on the value of a column.
Let’s say that you wanted to look at the cases where the victim’s first name was “Juan”. You could create a new series which is either True
or False
for each case.
df['victim_first'] == 'JUAN'
You could store this new true/false series. If you placed this in brackets after the name of the dataframe, pandas would display only the rows with a True value.
is_juan = df['victim_first'] == 'JUAN'
df[is_juan]
More commonly, the two statements are combined.
df[df['victim_first'] == 'JUAN']
With this method of subsetting, pandas doesn’t return a new dataframe, but rather is just hiding some of the rows. So if you want to create a new dataframe based on this subset, you need to append copy()
to the end.
new_df = df[df['victim_first'] == 'JUAN'].copy()
new_df.head()
As this selection method returns a dataframe, it can be stored. The following creates two dataframes, one with just the 2016 and one with just the 2017 cases.
df_2017 = df[df['year'] == 2017].copy()
df_2016 = df[df['year'] == 2016].copy()
df_2017['year'].value_counts()
df_2016['year'].value_counts()
value_counts
confirms that the correct cases were grabbed.
Alternatively you may want to limit your dataset by column. In this case, you create a list of the columns you want. This list is also placed in brackets after the dataframe name.
Your turn
Create a new dataframe with just the female passengers. Check your work.
Sample answer code
mask = titanic_df[‘Sex’] == ‘female’
titanic_df_f = titanic_df[mask]
titanic_df_f[‘Sex’].value_counts()
More subsets
columns_to_keep = ['victim_last', 'victim_first', 'victim_race', 'victim_age', 'victim_sex']
df[columns_to_keep]
As before, you can you use copy
to create a new dataset.
victim_df = df[columns_to_keep].copy()
victim_df.head()
As with the row selection, you don’t need to store the column names in a list first. By convention, these two steps are combined. However, combining the steps does create an awkward pair of double brackets.
place_df = df[['city', 'state', 'lat', 'lon']].copy()
place_df.head()
Merging
There are several different ways to combine datasets. The most straightforward is to merge two different datasets who share a key in common. To merge place_df
with victim_df
, for example, you can use the datframe merge
method.
merged_df = place_df.merge(victim_df, left_index=True, right_index=True)
merged_df.head()
Stacking dataframes
If you have multiple dataframes which each row represents the same kind of object, such as passengers on a ship or murder victims, you can stact them using pd.concat
. Stacking is most effective when column names overlap between the datasets. Below, I create a new dataframe by stacking together victims from 2016 and 2017.
df_2016 = df[df['year'] == 2016]
len(df_2016)
recent_df = pd.concat([df_2017, df_2016])
len(recent_df)
New features
New columns in a dataframe can be created in the same manner that new objects are created in Python.
df['birth_year'] = df['year'] - df['victim_age']
df['birth_year'].describe()
df['minor'] = df['victim_age'] <= 18
df['minor'][:10]
df['minor'].mean()
Your turn
Create a new variable in your Titanic dataframe which marks the people who paid a fare in the top 25% of all fares paid.
Sample answer code
titanic_df[‘Fare’].describe()
threshold = 7.9104
titanic_df[‘Top_25_Fare’] = titanic_df[‘Fare’] > threshold titanic_df[‘Top_25_Fare’].value_counts()
Back to some pandas string manipulation fun.
As discussed in the first notebook, Python programming involves creating functions, such as one to do a simple string manipulation.
def title_case(text):
return text.title()
title_case('JUAN')
The apply magic
You can apply
a function to a dataframe column in order to transfrom each value. The results can also be stored as a new feature.
df['victim_first'].apply(title_case)
df['victim_first2'] = df['victim_first'].apply(title_case)
df['victim_first2'].sample(4)
df[['victim_first', 'victim_first2']].sample(5)
Your turn
Write a function that extracts the last name from the name field on your Titanic dataframe.
Create a new variable called Family_Name
to store the results. What is the most common family name?
Sample answer code
def make_name(name):
family_name = name.split(‘, ‘)[0]
return family_name
titanic_df[‘Family_Name’] = titanic_df[‘Name’].apply(make_name) titanic_df[‘Family_Name’].sample(3)
Working on more than one column (optional)
If you want a function to work with more than one column, you apply the function to the dataframe. The function will evaluate each row, so you have to specify in the function which columns you want to use.
def victim_name(row):
first_name = row['victim_first']
last_name = row['victim_last']
name = last_name + ', ' + first_name
name = title_case(name)
return name
df.apply(victim_name, axis=1)
Note that we include axis=1
because we want the function to be applied to each row. The default is to apply the function to each column.
df['victim_name'] = df.apply(victim_name, axis=1)
cols_2_show = ['victim_first', 'victim_last', 'victim_name']
df[cols_2_show].sample(5)
Your turn
Write a function that tags all titanic passgers who are under 18 and who are traveling without parents (parch == 0
)? Use the function to create a new variable.
Sample answer code
def unac_minor(row):
minor = row[‘Age’] < 18
no_parents = row[‘Parch’] == 0
if minor == True and no_parents == True:
return True
return False
titanic_df[‘Unaccomp_Minor’] = titanic_df.apply(unac_minor, axis=1)
titanic_df[‘Unaccomp_Minor’].value_counts()
Congratulations! You’ve now been introduced to the basics of data management in pandas for social scientists.