Data management and exploration in pandas

Last updated on Feb 24, 2020

Data management and exploration in pandas

This section provides a brief introduction to pandas. The pandas library is a key component for doing data science in Python for a couple of reasons. Most importantly, it provides two data types, series and data frame, that allow you to store and manipulate data in a way that is useful for analysis. Second, it is incredibly useful for importing and exporting data in a wide variety of formats. Finally, it makes descriptive analysis, including both summary statistics and visualizations. This section provides an introduction to the main capabilities of pandas relevant to data analysis.

Most of the things that you will want to do in Python require importing libraries. By convention, pandas is imported as pd. Additionally, we enable the ability for pandas graphics to be displayed within the notebook with %matplotlib inline.

import pandas as pd

%matplotlib inline

Reading data

In the summer of 2017, the Washington Post produced a report on murder clearance rates in U.S. cities. They also released the data they collected on Github as a csv file. We can create a new dataframe, called df, using the pandas read_csv method.

df = pd.read_csv('data/homicide.csv')

df

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
0	Alb-000001	20100504	GARCIA	JUAN	Hispanic	78.0	Male	Albuquerque	NM	35.095788	-106.538555	Closed without arrest
1	Alb-000002	20100216	MONTOYA	CAMERON	Hispanic	17.0	Male	Albuquerque	NM	35.056810	-106.715321	Closed by arrest
2	Alb-000003	20100601	SATTERFIELD	VIVIANA	White	15.0	Female	Albuquerque	NM	35.086092	-106.695568	Closed without arrest
3	Alb-000004	20100101	MENDIOLA	CARLOS	Hispanic	32.0	Male	Albuquerque	NM	35.078493	-106.556094	Closed by arrest
4	Alb-000005	20100102	MULA	VIVIAN	White	72.0	Female	Albuquerque	NM	35.130357	-106.580986	Closed without arrest
5	Alb-000006	20100126	BOOK	GERALDINE	White	91.0	Female	Albuquerque	NM	35.151110	-106.537797	Open/No arrest
6	Alb-000007	20100127	MALDONADO	DAVID	Hispanic	52.0	Male	Albuquerque	NM	35.111785	-106.712614	Closed by arrest
7	Alb-000008	20100127	MALDONADO	CONNIE	Hispanic	52.0	Female	Albuquerque	NM	35.111785	-106.712614	Closed by arrest
8	Alb-000009	20100130	MARTIN-LEYVA	GUSTAVO	White	56.0	Male	Albuquerque	NM	35.075380	-106.553458	Open/No arrest
9	Alb-000010	20100210	HERRERA	ISRAEL	Hispanic	43.0	Male	Albuquerque	NM	35.065930	-106.572288	Open/No arrest
10	Alb-000011	20100212	BARRIUS-CAMPANIONI	HECTOR	Hispanic	20.0	Male	Albuquerque	NM	35.077375	-106.560569	Closed by arrest
11	Alb-000012	20100218	LUJAN	KEVIN	White	NaN	Male	Albuquerque	NM	35.077011	-106.564910	Closed without arrest
12	Alb-000013	20100222	COLLAMORE	JOHN	Hispanic	46.0	Male	Albuquerque	NM	35.064076	-106.608281	Open/No arrest
13	Alb-000014	20100306	CHIQUITO	CORIN	Other	16.0	Male	Albuquerque	NM	35.041491	-106.742147	Closed by arrest
14	Alb-000015	20100308	TORRES	HECTOR	Hispanic	54.0	Male	Albuquerque	NM	35.070656	-106.615845	Closed by arrest
15	Alb-000016	20100308	GRAY	STEFANIA	White	43.0	Female	Albuquerque	NM	35.070656	-106.615845	Closed by arrest
16	Alb-000017	20100322	LEYVA	JOEL	Hispanic	52.0	Male	Albuquerque	NM	35.079082	-106.495949	Closed by arrest
17	Alb-000018	20100323	DAVID	LARRY	White	52.0	Male	Albuquerque	NM	NaN	NaN	Closed by arrest
18	Alb-000019	20100402	BRITO	ELIZABETH	White	22.0	Female	Albuquerque	NM	35.110404	-106.523668	Closed by arrest
19	Alb-000020	20100420	CHAVEZ	GREG SR.	Hispanic	49.0	Male	Albuquerque	NM	35.089573	-106.570078	Closed by arrest
20	Alb-000021	20100423	KING	TEVION	Black	15.0	Male	Albuquerque	NM	35.095887	-106.638081	Closed by arrest
21	Alb-000022	20100423	BOYKIN	CEDRIC	Black	25.0	Male	Albuquerque	NM	35.095887	-106.638081	Closed by arrest
22	Alb-000023	20100518	BARRAGAN	MIGUEL	White	20.0	Male	Albuquerque	NM	35.083401	-106.632941	Closed by arrest
23	Alb-000024	20100602	FORD	LUTHER	Other	47.0	Male	Albuquerque	NM	35.102423	-106.567674	Closed by arrest
24	Alb-000025	20100602	WRONSKI	VIOLA	White	88.0	Female	Albuquerque	NM	35.095956	-106.561135	Closed without arrest
25	Alb-000026	20100603	ASHFORD	GUADALUPE	Hispanic	27.0	Female	Albuquerque	NM	35.078462	-106.551755	Closed by arrest
26	Alb-000027	20100712	TURNER	MICHELLE	White	36.0	Female	Albuquerque	NM	35.053748	-106.531800	Closed without arrest
27	Alb-000028	20100712	CUNNINGHAM	SHARON	White	47.0	Female	Albuquerque	NM	35.053748	-106.531800	Closed without arrest
28	Alb-000029	20100726	NGUYEN	SELENAVI	Asian	1.0	Female	Albuquerque	NM	35.064456	-106.524458	Closed without arrest
29	Alb-000030	20100728	VALDEZ	BILL	Hispanic	58.0	Male	Albuquerque	NM	35.061827	-106.749104	Open/No arrest
...	...	...	...	...	...	...	...	...	...	...	...	...
52149	Was-001353	20160324	TURNER	GABRIEL	Black	46.0	Male	Washington	DC	38.857628	-76.996482	Closed by arrest
52150	Was-001355	20160518	MERIEDY	THOMAS	Black	24.0	Male	Washington	DC	38.862517	-76.990880	Open/No arrest
52151	Was-001356	20160906	COOK	JOE	Black	35.0	Male	Washington	DC	38.854721	-76.990795	Closed by arrest
52152	Was-001357	20160917	PHILLIPS	SCORPIO	Black	31.0	Male	Washington	DC	38.877003	-76.995627	Open/No arrest
52153	Was-001358	20160917	HARRIS	ZORUAN	Black	18.0	Male	Washington	DC	38.877003	-76.995627	Open/No arrest
52154	Was-001359	20161108	PRATT	ANTINA	Black	40.0	Female	Washington	DC	38.855699	-76.993248	Open/No arrest
52155	Was-001360	20161127	SILVER	ORLANDO	Black	37.0	Male	Washington	DC	38.860054	-76.991825	Closed by arrest
52156	Was-001361	20160304	REZENE	NOEL	Black	26.0	Male	Washington	DC	38.849150	-76.971272	Closed by arrest
52157	Was-001362	20160511	DRAYTON	JAVION	Black	3.0	Male	Washington	DC	38.853961	-76.984712	Closed without arrest
52158	Was-001363	20160702	MCCULLOUGH	ANTOINE	Black	30.0	Male	Washington	DC	38.843740	-76.977839	Closed by arrest
52159	Was-001365	20161111	BUIE	SAMUELLE	Black	25.0	Male	Washington	DC	38.851284	-76.981029	Open/No arrest
52160	Was-001366	20161111	JENNINGS	RASSAAN	Black	19.0	Male	Washington	DC	38.851284	-76.981029	Open/No arrest
52161	Was-001367	20161122	TYLER	MARQUETT	Black	26.0	Male	Washington	DC	38.849967	-76.974421	Open/No arrest
52162	Was-001368	20160302	GARRIS	RUDOLPH	Black	25.0	Male	Washington	DC	38.827191	-76.998183	Closed by arrest
52163	Was-001369	20160317	DANSBURY	AUBREY	Black	27.0	Male	Washington	DC	38.829442	-76.993175	Closed by arrest
52164	Was-001370	20160510	HANEY	JAQUAN	Black	22.0	Male	Washington	DC	38.833943	-76.990742	Open/No arrest
52165	Was-001371	20160519	HAMILTON	DANA	Black	44.0	Male	Washington	DC	38.828365	-76.994431	Closed by arrest
52166	Was-001372	20160609	WRIGHT	RASHAWN	Black	25.0	Male	Washington	DC	38.835947	-76.990438	Open/No arrest
52167	Was-001373	20160626	BLACKWELL	WESTLEY	Black	38.0	Male	Washington	DC	38.834756	-76.992662	Open/No arrest
52168	Was-001374	20161227	DOWTIN	HERBERT	Black	22.0	Male	Washington	DC	38.833190	-76.994345	Closed by arrest
52169	Was-001375	20160224	MEDLAY	DEMETRIUS	Black	22.0	Male	Washington	DC	38.843399	-77.000104	Closed by arrest
52170	Was-001376	20160731	STEPHENS	JUDONNE	Black	25.0	Male	Washington	DC	38.863322	-76.995309	Open/No arrest
52171	Was-001377	20160916	PEOPLES	DARNELL	Black	35.0	Male	Washington	DC	38.845871	-76.998169	Closed by arrest
52172	Was-001378	20160415	IVEY	PAUL	Black	37.0	Male	Washington	DC	38.826458	-77.003590	Closed by arrest
52173	Was-001379	20160715	HARRIS	SHAROD	Black	20.0	Male	Washington	DC	38.827266	-77.001572	Open/No arrest
52174	Was-001380	20160908	WILLIAMS	EVAN	Black	29.0	Male	Washington	DC	38.828704	-77.002075	Closed by arrest
52175	Was-001381	20160913	SMITH	DEON	Black	19.0	Male	Washington	DC	38.822852	-77.001725	Open/No arrest
52176	Was-001382	20161114	WASHINGTON	WILLIE	Black	23.0	Male	Washington	DC	38.828025	-77.002511	Open/No arrest
52177	Was-001383	20161130	BARNES	MARCUS	Black	24.0	Male	Washington	DC	38.820476	-77.008640	Open/No arrest
52178	Was-001384	20160901	JACKSON	KEVIN	Black	17.0	Male	Washington	DC	38.866689	-76.982409	Closed by arrest

52179 rows × 12 columns

If you have the URL of a csv file, you can load it directly:

csv_url = 'https://raw.githubusercontent.com/nealcaren/data/master/sets/homicide.csv'

df = pd.read_csv(csv_url)

Learning about your dataframe

After loading a dataframe, best practice is to get a sense of the data with the head, info and describe methods. head shows the first five rows of the dataframe.

df.head()

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
0	Alb-000001	20100504	GARCIA	JUAN	Hispanic	78.0	Male	Albuquerque	NM	35.095788	-106.538555	Closed without arrest
1	Alb-000002	20100216	MONTOYA	CAMERON	Hispanic	17.0	Male	Albuquerque	NM	35.056810	-106.715321	Closed by arrest
2	Alb-000003	20100601	SATTERFIELD	VIVIANA	White	15.0	Female	Albuquerque	NM	35.086092	-106.695568	Closed without arrest
3	Alb-000004	20100101	MENDIOLA	CARLOS	Hispanic	32.0	Male	Albuquerque	NM	35.078493	-106.556094	Closed by arrest
4	Alb-000005	20100102	MULA	VIVIAN	White	72.0	Female	Albuquerque	NM	35.130357	-106.580986	Closed without arrest

In addition to the data in the csv file, an index has been created to identifiy each row. By default, this is an interger starting with 0.

If the dataset is wide, middle columns will not be displayed. Also, if text fields are long, only the first few characters will be shown. These can both be adjusted using pandas display settings.

info can be used to explore the data types and the number of non-missing cases for each variable.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52179 entries, 0 to 52178
Data columns (total 12 columns):
uid              52179 non-null object
reported_date    52179 non-null int64
victim_last      52179 non-null object
victim_first     52179 non-null object
victim_race      52179 non-null object
victim_age       49180 non-null float64
victim_sex       52179 non-null object
city             52179 non-null object
state            52179 non-null object
lat              52119 non-null float64
lon              52119 non-null float64
disposition      52179 non-null object
dtypes: float64(3), int64(1), object(8)
memory usage: 4.8+ MB

describe provides summary statistics for all the numeric variables.

df.describe()

	reported_date	victim_age	lat	lon
count	5.217900e+04	49180.000000	52119.000000	52119.000000
mean	2.013090e+07	31.801220	37.026786	-91.471094
std	1.123420e+06	14.418692	4.348647	13.746378
min	2.007010e+07	0.000000	25.725214	-122.507779
25%	2.010032e+07	22.000000	33.765203	-95.997198
50%	2.012122e+07	28.000000	38.524973	-87.710286
75%	2.015091e+07	40.000000	40.027627	-81.755909
max	2.015111e+08	102.000000	45.051190	-71.011519

The column headers can be extracted using keys.

df.keys()

Index(['uid', 'reported_date', 'victim_last', 'victim_first', 'victim_race',
       'victim_age', 'victim_sex', 'city', 'state', 'lat', 'lon',
       'disposition'],
      dtype='object')

If you wanted to look at the bottom of the dataframe, you can use tail. Both head and tail allow you to change the number of rows displayed from the default five.

df.tail(3)

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
52176	Was-001382	20161114	WASHINGTON	WILLIE	Black	23.0	Male	Washington	DC	38.828025	-77.002511	Open/No arrest
52177	Was-001383	20161130	BARNES	MARCUS	Black	24.0	Male	Washington	DC	38.820476	-77.008640	Open/No arrest
52178	Was-001384	20160901	JACKSON	KEVIN	Black	17.0	Male	Washington	DC	38.866689	-76.982409	Closed by arrest

sample displays random rows from the dataframe.

df.sample(5)

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
28043	Las-001175	20170121	SANCHEZ	ALBERTO	Hispanic	24.0	Male	Las Vegas	NV	36.189362	-115.061763	Closed by arrest
13424	Cin-700043	20170728	ROBERTSON	RICKEY	Black	58.0	Male	Cincinnati	OH	39.140126	-84.476951	Open/No arrest
13253	Cin-000947	20140531	GASSETT	JOSHUA	Black	24.0	Male	Cincinnati	OH	39.162160	-84.382659	Closed by arrest
51164	Was-000343	20130907	SWEET	KENNETH	Black	30.0	Male	Washington	DC	38.877011	-77.002498	Open/No arrest
16190	Den-000087	20120318	NEWSON	BILLY	Black	29.0	Male	Denver	CO	39.740021	-104.979913	Closed by arrest

Your turn

Display the first four rows of the dataframe df.

Sample answer code


df.head(4)

Working with variables

df.head(2)

df['victim_age'].describe()

df['victim_age'].value_counts()

As this has many values, pandas only displays the top and bottom 30 cases. The values method can be used to produce an array containing all the values in order.

Your turn

Explore the disposition and victim race columns in the dataframe.

Sample answer code


df[‘disposition’].value_counts()
df[‘victim_race’].value_counts()
df[[‘disposition’,‘victim_race’]].describe()

All of the values of a specific variables can be extracted.

ages = df['victim_age'].values
len(ages)

first_age = ages[0]
print(first_age)

Your turn

Display seven value from the middle of our age variable.

Sample answer code


ages[1201:1208]

Your turn

A well-known data set is the list of titanic passengers. A version can be found in the data folder called, "titanic.csv". Open the file as a new dataframe titanic_df. How many cases? How many columns? What can you find out about the data?

Sample answer code


titanic_df = pd.read_csv(‘data/titanic.csv’)
titanic_df.describe()
titanic_df.info()
titanic_df.sample(5)

Plots (optional)

pandas also has plotting capabilies, such as histograms (hist) and a correlation matrix (scatter_matrix).

%matplotlib inline

df['victim_age'].hist()

Plot of individual variables, or series in pandas terminology, are attributes of the data type. That is, you start wit the thing you want plotted, in this case df['victim_age'], and append what you want to do, such as .hist().

A second type of plots, such as scatter plots, are methods of the dataframe.

df.plot.scatter(x='lon', y='lat')

You could look at the other dataframe plotting methods on the helpful pandas visualizations page. Alternatively, typing tab after df.plot. also reveals your options.

Want to know about hexbin? Again, the help page on the web is useful, but appending a question mark to the end of the command will bring up the documentation.


<img src="images/docstring.png" width = "80%" align="left"/>

A third group of plots are part of the pandas plotting library. In these cases, the thing you want plotted is the first, or only, parameter passed, as is the case with the correlation matrix.


```python
pd.plotting.scatter_matrix(df)

Finally, you can also create subplots using the by option. Note that by accepts a series, or dataframe column, rather than a column name.

df['victim_age'].hist(by = df['victim_sex'],
                      bins = 20)

By default, by produces separate x and y scales for each subgraph. This is why it appears to be a relatively large number of deaths of very young females. The numbers between men and women at this age are comparable, but the very large number of male deaths in their 20s results in very different xscales for the graphs. This option can be changed with the sharex or sharey option.

df['victim_age'].hist(by = df['victim_sex'],
                      bins   = 20,
                      sharex = True,
                      sharey = True)

Other descriptives

Pandas also has a method for producing crosstabs.

pd.crosstab(df['victim_race'], df['disposition'])

Note that since this is a pandas method, and not one of a specific dataframe, you need to be explicit about which datatframe each variable is coming from. That is why the first parameter is not 'victim_race' but df['victim_race'].

normalize can be used to display percentages instead of frequencies. A value of index normalized by row, columns by column, and all by all values.

pd.crosstab(df['victim_race'],
            df['disposition'],
            normalize='index')

Since this returns a dataframe, it can be saved or plotted.

cross_tab = pd.crosstab(df['victim_race'],
                        df['disposition'],
                        normalize='index')

cross_tab

cross_tab.to_csv('data/crosstab.csv')

Your turn

In your titanic dataframe, run a crosstab between sex and survived. Anything interesting?

Sample answer code


pd.crosstab(titanic_df[‘Survived’],
            titanic_df[‘Sex’], normalize=‘index’)

In order to highlight a meaningful characteristics of the data, you can sort before plotting.

cross_tab.sort_values(by='Closed by arrest')

cross_tab.sort_values(by='Closed by arrest').plot(kind   = 'barh',
                                                  title  = 'Disposition by race')

Subsets

Similar to a list, a dataframe or series can be sliced to subset the data being shown. For example, df[:2] will return the first two rows of the dataframe. (This is identical to df.head(2).)

df[:2]

df.head(2)

This also works for specific columns.

df['reported_date'][:3]

Dates (optional)

A new variable can be created from reported_date that pandas understands is a date variable using the to_datetime method. The format is %Y%m%d because the original date is in the “YYYMMDD” format, and coerce places missing values where the data can be translated, rather than stopping the variable creation completely.

df['reported_date'].head()

df['date'] = pd.to_datetime(df['reported_date'],
                            format='%Y%m%d',
                            errors='coerce')

df['date'][:3]

From the new series, we can extract specific elements, such as the year.

df['year'] = df['date'].dt.year

As before, value_counts and plots can give some sense of the distribution of the values.

df['year'].value_counts()

Value counts returns a pandas series with an index equal to the original values, in the case the year, and the series values based on the frequency. Since years have an order, it makes sense to sort by the index before plotting them.

df['year'].value_counts().sort_index(ascending = False).plot(kind='barh')

crosstab can also group based on more than one variable for the x or y axis. In that case, you pass a list rather than a single variable or series. To make this clearer, you can create the lists before creating the crosstab.

y_vars = [df['state'], df['city']]
x_vars = df['year']

pd.crosstab(y_vars, x_vars)

Crosstab returns a dataframe with the column and index names from the values in the original dataset. Since a list was passed, the datatframe has a MultiIndex. The can be useful for cases where you have nested data, like cities with states or annual data on multiple countries.

pd.crosstab(y_vars, x_vars).index.names

Index

df.head()

By default, the index is a series that starts with 0. If your data has unique identifiers, however, it is helpful to use that as the index, especially if you intend on merging your data with other data sources. In this dataframe, each row has a unique value for uid.

df.set_index('uid', inplace=True)

df[:5]

Your turn

In your Titanic dataframe, set the index to the PassengerId column. Confirm that it did want you wanted it to do.

Sample answer code


titanic_df.set_index(‘PassengerId’, inplace=True)
titanic_df.sample(3)

Subseting

You can view a subset of a dataframe based on the value of a column.

Let’s say that you wanted to look at the cases where the victim’s first name was “Juan”. You could create a new series which is either True or False for each case.

df['victim_first'] == 'JUAN'

You could store this new true/false series. If you placed this in brackets after the name of the dataframe, pandas would display only the rows with a True value.

is_juan = df['victim_first'] == 'JUAN'
df[is_juan]

More commonly, the two statements are combined.

df[df['victim_first'] == 'JUAN']

With this method of subsetting, pandas doesn’t return a new dataframe, but rather is just hiding some of the rows. So if you want to create a new dataframe based on this subset, you need to append copy() to the end.

new_df = df[df['victim_first'] == 'JUAN'].copy()

new_df.head()

As this selection method returns a dataframe, it can be stored. The following creates two dataframes, one with just the 2016 and one with just the 2017 cases.

df_2017 = df[df['year'] == 2017].copy()
df_2016 = df[df['year'] == 2016].copy()


df_2017['year'].value_counts()

df_2016['year'].value_counts()

value_counts confirms that the correct cases were grabbed.

Alternatively you may want to limit your dataset by column. In this case, you create a list of the columns you want. This list is also placed in brackets after the dataframe name.

Your turn

Create a new dataframe with just the female passengers. Check your work.

Sample answer code


mask = titanic_df[‘Sex’] == ‘female’
titanic_df_f = titanic_df[mask]
titanic_df_f[‘Sex’].value_counts()

More subsets

columns_to_keep = ['victim_last', 'victim_first', 'victim_race', 'victim_age', 'victim_sex']

df[columns_to_keep]

As before, you can you use copy to create a new dataset.

victim_df = df[columns_to_keep].copy()

victim_df.head()

As with the row selection, you don’t need to store the column names in a list first. By convention, these two steps are combined. However, combining the steps does create an awkward pair of double brackets.

place_df = df[['city', 'state', 'lat', 'lon']].copy()

place_df.head()

Merging

There are several different ways to combine datasets. The most straightforward is to merge two different datasets who share a key in common. To merge place_df with victim_df, for example, you can use the datframe merge method.

merged_df = place_df.merge(victim_df, left_index=True, right_index=True)

merged_df.head()

Stacking dataframes

If you have multiple dataframes which each row represents the same kind of object, such as passengers on a ship or murder victims, you can stact them using pd.concat. Stacking is most effective when column names overlap between the datasets. Below, I create a new dataframe by stacking together victims from 2016 and 2017.

df_2016 = df[df['year'] == 2016]
len(df_2016)

recent_df = pd.concat([df_2017, df_2016])

len(recent_df)

New features

New columns in a dataframe can be created in the same manner that new objects are created in Python.

df['birth_year'] = df['year'] - df['victim_age']

df['birth_year'].describe()

df['minor'] = df['victim_age'] <= 18

df['minor'][:10]

df['minor'].mean()

Your turn

Create a new variable in your Titanic dataframe which marks the people who paid a fare in the top 25% of all fares paid.

Sample answer code


titanic_df[‘Fare’].describe()
threshold = 7.9104

titanic_df[‘Top_25_Fare’] = titanic_df[‘Fare’] > threshold titanic_df[‘Top_25_Fare’].value_counts()

Back to some pandas string manipulation fun.

As discussed in the first notebook, Python programming involves creating functions, such as one to do a simple string manipulation.

def title_case(text):
    return text.title()

title_case('JUAN')

The apply magic

You can apply a function to a dataframe column in order to transfrom each value. The results can also be stored as a new feature.

df['victim_first'].apply(title_case)

df['victim_first2'] = df['victim_first'].apply(title_case)

df['victim_first2'].sample(4)

df[['victim_first', 'victim_first2']].sample(5)

Your turn

Write a function that extracts the last name from the name field on your Titanic dataframe. Create a new variable called Family_Name to store the results. What is the most common family name?

Sample answer code


def make_name(name):
    family_name = name.split(‘, ‘)[0]
    return family_name

titanic_df[‘Family_Name’] = titanic_df[‘Name’].apply(make_name) titanic_df[‘Family_Name’].sample(3)

Working on more than one column (optional)

If you want a function to work with more than one column, you apply the function to the dataframe. The function will evaluate each row, so you have to specify in the function which columns you want to use.

def victim_name(row):
    first_name = row['victim_first']
    last_name  = row['victim_last']
    name       = last_name + ', ' + first_name
    name       = title_case(name)
    return name

df.apply(victim_name, axis=1)

Note that we include axis=1 because we want the function to be applied to each row. The default is to apply the function to each column.

df['victim_name'] = df.apply(victim_name, axis=1)

cols_2_show = ['victim_first', 'victim_last', 'victim_name']

df[cols_2_show].sample(5)

Your turn

Write a function that tags all titanic passgers who are under 18 and who are traveling without parents (parch == 0)? Use the function to create a new variable.

Sample answer code


def unac_minor(row):
    minor      = row[‘Age’] < 18
    no_parents = row[‘Parch’] == 0
    if minor == True and no_parents == True:
        return True
    return False
titanic_df[‘Unaccomp_Minor’] = titanic_df.apply(unac_minor, axis=1)
titanic_df[‘Unaccomp_Minor’].value_counts()

Congratulations! You’ve now been introduced to the basics of data management in pandas for social scientists.

Neal Caren

Associate Professor of Sociology

My research interests include social movements, protest events, web scraping, and text analysis.