Data management and exploration in pandas

Data management and exploration in pandas

This section provides a brief introduction to pandas. The pandas library is a key component for doing data science in Python for a couple of reasons. Most importantly, it provides two data types, series and data frame, that allow you to store and manipulate data in a way that is useful for analysis. Second, it is incredibly useful for importing and exporting data in a wide variety of formats. Finally, it makes descriptive analysis, including both summary statistics and visualizations. This section provides an introduction to the main capabilities of pandas relevant to data analysis.

Most of the things that you will want to do in Python require importing libraries. By convention, pandas is imported as pd. Additionally, we enable the ability for pandas graphics to be displayed within the notebook with %matplotlib inline.

import pandas as pd

%matplotlib inline

Reading data

In the summer of 2017, the Washington Post produced a report on murder clearance rates in U.S. cities. They also released the data they collected on Github as a csv file. We can create a new dataframe, called df, using the pandas read_csv method.

df = pd.read_csv('data/homicide.csv')
df
uid reported_date victim_last victim_first victim_race victim_age victim_sex city state lat lon disposition
0 Alb-000001 20100504 GARCIA JUAN Hispanic 78.0 Male Albuquerque NM 35.095788 -106.538555 Closed without arrest
1 Alb-000002 20100216 MONTOYA CAMERON Hispanic 17.0 Male Albuquerque NM 35.056810 -106.715321 Closed by arrest
2 Alb-000003 20100601 SATTERFIELD VIVIANA White 15.0 Female Albuquerque NM 35.086092 -106.695568 Closed without arrest
3 Alb-000004 20100101 MENDIOLA CARLOS Hispanic 32.0 Male Albuquerque NM 35.078493 -106.556094 Closed by arrest
4 Alb-000005 20100102 MULA VIVIAN White 72.0 Female Albuquerque NM 35.130357 -106.580986 Closed without arrest
5 Alb-000006 20100126 BOOK GERALDINE White 91.0 Female Albuquerque NM 35.151110 -106.537797 Open/No arrest
6 Alb-000007 20100127 MALDONADO DAVID Hispanic 52.0 Male Albuquerque NM 35.111785 -106.712614 Closed by arrest
7 Alb-000008 20100127 MALDONADO CONNIE Hispanic 52.0 Female Albuquerque NM 35.111785 -106.712614 Closed by arrest
8 Alb-000009 20100130 MARTIN-LEYVA GUSTAVO White 56.0 Male Albuquerque NM 35.075380 -106.553458 Open/No arrest
9 Alb-000010 20100210 HERRERA ISRAEL Hispanic 43.0 Male Albuquerque NM 35.065930 -106.572288 Open/No arrest
10 Alb-000011 20100212 BARRIUS-CAMPANIONI HECTOR Hispanic 20.0 Male Albuquerque NM 35.077375 -106.560569 Closed by arrest
11 Alb-000012 20100218 LUJAN KEVIN White NaN Male Albuquerque NM 35.077011 -106.564910 Closed without arrest
12 Alb-000013 20100222 COLLAMORE JOHN Hispanic 46.0 Male Albuquerque NM 35.064076 -106.608281 Open/No arrest
13 Alb-000014 20100306 CHIQUITO CORIN Other 16.0 Male Albuquerque NM 35.041491 -106.742147 Closed by arrest
14 Alb-000015 20100308 TORRES HECTOR Hispanic 54.0 Male Albuquerque NM 35.070656 -106.615845 Closed by arrest
15 Alb-000016 20100308 GRAY STEFANIA White 43.0 Female Albuquerque NM 35.070656 -106.615845 Closed by arrest
16 Alb-000017 20100322 LEYVA JOEL Hispanic 52.0 Male Albuquerque NM 35.079082 -106.495949 Closed by arrest
17 Alb-000018 20100323 DAVID LARRY White 52.0 Male Albuquerque NM NaN NaN Closed by arrest
18 Alb-000019 20100402 BRITO ELIZABETH White 22.0 Female Albuquerque NM 35.110404 -106.523668 Closed by arrest
19 Alb-000020 20100420 CHAVEZ GREG SR. Hispanic 49.0 Male Albuquerque NM 35.089573 -106.570078 Closed by arrest
20 Alb-000021 20100423 KING TEVION Black 15.0 Male Albuquerque NM 35.095887 -106.638081 Closed by arrest
21 Alb-000022 20100423 BOYKIN CEDRIC Black 25.0 Male Albuquerque NM 35.095887 -106.638081 Closed by arrest
22 Alb-000023 20100518 BARRAGAN MIGUEL White 20.0 Male Albuquerque NM 35.083401 -106.632941 Closed by arrest
23 Alb-000024 20100602 FORD LUTHER Other 47.0 Male Albuquerque NM 35.102423 -106.567674 Closed by arrest
24 Alb-000025 20100602 WRONSKI VIOLA White 88.0 Female Albuquerque NM 35.095956 -106.561135 Closed without arrest
25 Alb-000026 20100603 ASHFORD GUADALUPE Hispanic 27.0 Female Albuquerque NM 35.078462 -106.551755 Closed by arrest
26 Alb-000027 20100712 TURNER MICHELLE White 36.0 Female Albuquerque NM 35.053748 -106.531800 Closed without arrest
27 Alb-000028 20100712 CUNNINGHAM SHARON White 47.0 Female Albuquerque NM 35.053748 -106.531800 Closed without arrest
28 Alb-000029 20100726 NGUYEN SELENAVI Asian 1.0 Female Albuquerque NM 35.064456 -106.524458 Closed without arrest
29 Alb-000030 20100728 VALDEZ BILL Hispanic 58.0 Male Albuquerque NM 35.061827 -106.749104 Open/No arrest
... ... ... ... ... ... ... ... ... ... ... ... ...
52149 Was-001353 20160324 TURNER GABRIEL Black 46.0 Male Washington DC 38.857628 -76.996482 Closed by arrest
52150 Was-001355 20160518 MERIEDY THOMAS Black 24.0 Male Washington DC 38.862517 -76.990880 Open/No arrest
52151 Was-001356 20160906 COOK JOE Black 35.0 Male Washington DC 38.854721 -76.990795 Closed by arrest
52152 Was-001357 20160917 PHILLIPS SCORPIO Black 31.0 Male Washington DC 38.877003 -76.995627 Open/No arrest
52153 Was-001358 20160917 HARRIS ZORUAN Black 18.0 Male Washington DC 38.877003 -76.995627 Open/No arrest
52154 Was-001359 20161108 PRATT ANTINA Black 40.0 Female Washington DC 38.855699 -76.993248 Open/No arrest
52155 Was-001360 20161127 SILVER ORLANDO Black 37.0 Male Washington DC 38.860054 -76.991825 Closed by arrest
52156 Was-001361 20160304 REZENE NOEL Black 26.0 Male Washington DC 38.849150 -76.971272 Closed by arrest
52157 Was-001362 20160511 DRAYTON JAVION Black 3.0 Male Washington DC 38.853961 -76.984712 Closed without arrest
52158 Was-001363 20160702 MCCULLOUGH ANTOINE Black 30.0 Male Washington DC 38.843740 -76.977839 Closed by arrest
52159 Was-001365 20161111 BUIE SAMUELLE Black 25.0 Male Washington DC 38.851284 -76.981029 Open/No arrest
52160 Was-001366 20161111 JENNINGS RASSAAN Black 19.0 Male Washington DC 38.851284 -76.981029 Open/No arrest
52161 Was-001367 20161122 TYLER MARQUETT Black 26.0 Male Washington DC 38.849967 -76.974421 Open/No arrest
52162 Was-001368 20160302 GARRIS RUDOLPH Black 25.0 Male Washington DC 38.827191 -76.998183 Closed by arrest
52163 Was-001369 20160317 DANSBURY AUBREY Black 27.0 Male Washington DC 38.829442 -76.993175 Closed by arrest
52164 Was-001370 20160510 HANEY JAQUAN Black 22.0 Male Washington DC 38.833943 -76.990742 Open/No arrest
52165 Was-001371 20160519 HAMILTON DANA Black 44.0 Male Washington DC 38.828365 -76.994431 Closed by arrest
52166 Was-001372 20160609 WRIGHT RASHAWN Black 25.0 Male Washington DC 38.835947 -76.990438 Open/No arrest
52167 Was-001373 20160626 BLACKWELL WESTLEY Black 38.0 Male Washington DC 38.834756 -76.992662 Open/No arrest
52168 Was-001374 20161227 DOWTIN HERBERT Black 22.0 Male Washington DC 38.833190 -76.994345 Closed by arrest
52169 Was-001375 20160224 MEDLAY DEMETRIUS Black 22.0 Male Washington DC 38.843399 -77.000104 Closed by arrest
52170 Was-001376 20160731 STEPHENS JUDONNE Black 25.0 Male Washington DC 38.863322 -76.995309 Open/No arrest
52171 Was-001377 20160916 PEOPLES DARNELL Black 35.0 Male Washington DC 38.845871 -76.998169 Closed by arrest
52172 Was-001378 20160415 IVEY PAUL Black 37.0 Male Washington DC 38.826458 -77.003590 Closed by arrest
52173 Was-001379 20160715 HARRIS SHAROD Black 20.0 Male Washington DC 38.827266 -77.001572 Open/No arrest
52174 Was-001380 20160908 WILLIAMS EVAN Black 29.0 Male Washington DC 38.828704 -77.002075 Closed by arrest
52175 Was-001381 20160913 SMITH DEON Black 19.0 Male Washington DC 38.822852 -77.001725 Open/No arrest
52176 Was-001382 20161114 WASHINGTON WILLIE Black 23.0 Male Washington DC 38.828025 -77.002511 Open/No arrest
52177 Was-001383 20161130 BARNES MARCUS Black 24.0 Male Washington DC 38.820476 -77.008640 Open/No arrest
52178 Was-001384 20160901 JACKSON KEVIN Black 17.0 Male Washington DC 38.866689 -76.982409 Closed by arrest

52179 rows × 12 columns

If you have the URL of a csv file, you can load it directly:

csv_url = 'https://raw.githubusercontent.com/nealcaren/data/master/sets/homicide.csv'

df = pd.read_csv(csv_url)

Learning about your dataframe

After loading a dataframe, best practice is to get a sense of the data with the head, info and describe methods. head shows the first five rows of the dataframe.

df.head()
uid reported_date victim_last victim_first victim_race victim_age victim_sex city state lat lon disposition
0 Alb-000001 20100504 GARCIA JUAN Hispanic 78.0 Male Albuquerque NM 35.095788 -106.538555 Closed without arrest
1 Alb-000002 20100216 MONTOYA CAMERON Hispanic 17.0 Male Albuquerque NM 35.056810 -106.715321 Closed by arrest
2 Alb-000003 20100601 SATTERFIELD VIVIANA White 15.0 Female Albuquerque NM 35.086092 -106.695568 Closed without arrest
3 Alb-000004 20100101 MENDIOLA CARLOS Hispanic 32.0 Male Albuquerque NM 35.078493 -106.556094 Closed by arrest
4 Alb-000005 20100102 MULA VIVIAN White 72.0 Female Albuquerque NM 35.130357 -106.580986 Closed without arrest

In addition to the data in the csv file, an index has been created to identifiy each row. By default, this is an interger starting with 0.

If the dataset is wide, middle columns will not be displayed. Also, if text fields are long, only the first few characters will be shown. These can both be adjusted using pandas display settings.

info can be used to explore the data types and the number of non-missing cases for each variable.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52179 entries, 0 to 52178
Data columns (total 12 columns):
uid              52179 non-null object
reported_date    52179 non-null int64
victim_last      52179 non-null object
victim_first     52179 non-null object
victim_race      52179 non-null object
victim_age       49180 non-null float64
victim_sex       52179 non-null object
city             52179 non-null object
state            52179 non-null object
lat              52119 non-null float64
lon              52119 non-null float64
disposition      52179 non-null object
dtypes: float64(3), int64(1), object(8)
memory usage: 4.8+ MB

describe provides summary statistics for all the numeric variables.

df.describe()
reported_date victim_age lat lon
count 5.217900e+04 49180.000000 52119.000000 52119.000000
mean 2.013090e+07 31.801220 37.026786 -91.471094
std 1.123420e+06 14.418692 4.348647 13.746378
min 2.007010e+07 0.000000 25.725214 -122.507779
25% 2.010032e+07 22.000000 33.765203 -95.997198
50% 2.012122e+07 28.000000 38.524973 -87.710286
75% 2.015091e+07 40.000000 40.027627 -81.755909
max 2.015111e+08 102.000000 45.051190 -71.011519

The column headers can be extracted using keys.

df.keys()
Index(['uid', 'reported_date', 'victim_last', 'victim_first', 'victim_race',
       'victim_age', 'victim_sex', 'city', 'state', 'lat', 'lon',
       'disposition'],
      dtype='object')

If you wanted to look at the bottom of the dataframe, you can use tail. Both head and tail allow you to change the number of rows displayed from the default five.

df.tail(3)
uid reported_date victim_last victim_first victim_race victim_age victim_sex city state lat lon disposition
52176 Was-001382 20161114 WASHINGTON WILLIE Black 23.0 Male Washington DC 38.828025 -77.002511 Open/No arrest
52177 Was-001383 20161130 BARNES MARCUS Black 24.0 Male Washington DC 38.820476 -77.008640 Open/No arrest
52178 Was-001384 20160901 JACKSON KEVIN Black 17.0 Male Washington DC 38.866689 -76.982409 Closed by arrest

sample displays random rows from the dataframe.

df.sample(5)
uid reported_date victim_last victim_first victim_race victim_age victim_sex city state lat lon disposition
28043 Las-001175 20170121 SANCHEZ ALBERTO Hispanic 24.0 Male Las Vegas NV 36.189362 -115.061763 Closed by arrest
13424 Cin-700043 20170728 ROBERTSON RICKEY Black 58.0 Male Cincinnati OH 39.140126 -84.476951 Open/No arrest
13253 Cin-000947 20140531 GASSETT JOSHUA Black 24.0 Male Cincinnati OH 39.162160 -84.382659 Closed by arrest
51164 Was-000343 20130907 SWEET KENNETH Black 30.0 Male Washington DC 38.877011 -77.002498 Open/No arrest
16190 Den-000087 20120318 NEWSON BILLY Black 29.0 Male Denver CO 39.740021 -104.979913 Closed by arrest

Your turn

Display the first four rows of the dataframe df.


Sample answer code df.head(4)

Working with variables

df.head(2)
df['victim_age'].describe()
df['victim_age'].value_counts()

As this has many values, pandas only displays the top and bottom 30 cases. The values method can be used to produce an array containing all the values in order.

Your turn

Explore the disposition and victim race columns in the dataframe.


Sample answer code df[‘disposition’].value_counts() df[‘victim_race’].value_counts() df[[‘disposition’,‘victim_race’]].describe()

All of the values of a specific variables can be extracted.

ages = df['victim_age'].values
len(ages)
first_age = ages[0]
print(first_age)

Your turn

Display seven value from the middle of our age variable.


Sample answer code ages[1201:1208]

Your turn

A well-known data set is the list of titanic passengers. A version can be found in the data folder called, "titanic.csv". Open the file as a new dataframe titanic_df. How many cases? How many columns? What can you find out about the data?


Sample answer code titanic_df = pd.read_csv(‘data/titanic.csv’) titanic_df.describe() titanic_df.info() titanic_df.sample(5)

Plots (optional)

pandas also has plotting capabilies, such as histograms (hist) and a correlation matrix (scatter_matrix).

%matplotlib inline
df['victim_age'].hist()

Plot of individual variables, or series in pandas terminology, are attributes of the data type. That is, you start wit the thing you want plotted, in this case df['victim_age'], and append what you want to do, such as .hist().

A second type of plots, such as scatter plots, are methods of the dataframe.

df.plot.scatter(x='lon', y='lat')

You could look at the other dataframe plotting methods on the helpful pandas visualizations page. Alternatively, typing tab after df.plot. also reveals your options.

Want to know about hexbin? Again, the help page on the web is useful, but appending a question mark to the end of the command will bring up the documentation.


<img src="images/docstring.png" width = "80%" align="left"/>

A third group of plots are part of the pandas plotting library. In these cases, the thing you want plotted is the first, or only, parameter passed, as is the case with the correlation matrix.


```python
pd.plotting.scatter_matrix(df)

Finally, you can also create subplots using the by option. Note that by accepts a series, or dataframe column, rather than a column name.

df['victim_age'].hist(by = df['victim_sex'],
                      bins = 20)

By default, by produces separate x and y scales for each subgraph. This is why it appears to be a relatively large number of deaths of very young females. The numbers between men and women at this age are comparable, but the very large number of male deaths in their 20s results in very different xscales for the graphs. This option can be changed with the sharex or sharey option.

df['victim_age'].hist(by = df['victim_sex'],
                      bins   = 20,
                      sharex = True,
                      sharey = True)

Other descriptives

Pandas also has a method for producing crosstabs.

pd.crosstab(df['victim_race'], df['disposition'])

Note that since this is a pandas method, and not one of a specific dataframe, you need to be explicit about which datatframe each variable is coming from. That is why the first parameter is not 'victim_race' but df['victim_race'].

normalize can be used to display percentages instead of frequencies. A value of index normalized by row, columns by column, and all by all values.

pd.crosstab(df['victim_race'],
            df['disposition'],
            normalize='index')

Since this returns a dataframe, it can be saved or plotted.

cross_tab = pd.crosstab(df['victim_race'],
                        df['disposition'],
                        normalize='index')

cross_tab
cross_tab.to_csv('data/crosstab.csv')

Your turn

In your titanic dataframe, run a crosstab between sex and survived. Anything interesting?


Sample answer code pd.crosstab(titanic_df[‘Survived’], titanic_df[‘Sex’], normalize=‘index’)

In order to highlight a meaningful characteristics of the data, you can sort before plotting.

cross_tab.sort_values(by='Closed by arrest')
cross_tab.sort_values(by='Closed by arrest').plot(kind   = 'barh',
                                                  title  = 'Disposition by race')

Subsets

Similar to a list, a dataframe or series can be sliced to subset the data being shown. For example, df[:2] will return the first two rows of the dataframe. (This is identical to df.head(2).)

df[:2]
df.head(2)

This also works for specific columns.

df['reported_date'][:3]

Dates (optional)

A new variable can be created from reported_date that pandas understands is a date variable using the to_datetime method. The format is %Y%m%d because the original date is in the “YYYMMDD” format, and coerce places missing values where the data can be translated, rather than stopping the variable creation completely.

df['reported_date'].head()
df['date'] = pd.to_datetime(df['reported_date'],
                            format='%Y%m%d',
                            errors='coerce')
df['date'][:3]

From the new series, we can extract specific elements, such as the year.

df['year'] = df['date'].dt.year

As before, value_counts and plots can give some sense of the distribution of the values.

df['year'].value_counts()

Value counts returns a pandas series with an index equal to the original values, in the case the year, and the series values based on the frequency. Since years have an order, it makes sense to sort by the index before plotting them.

df['year'].value_counts().sort_index(ascending = False).plot(kind='barh')

crosstab can also group based on more than one variable for the x or y axis. In that case, you pass a list rather than a single variable or series. To make this clearer, you can create the lists before creating the crosstab.

y_vars = [df['state'], df['city']]
x_vars = df['year']

pd.crosstab(y_vars, x_vars)

Crosstab returns a dataframe with the column and index names from the values in the original dataset. Since a list was passed, the datatframe has a MultiIndex. The can be useful for cases where you have nested data, like cities with states or annual data on multiple countries.

pd.crosstab(y_vars, x_vars).index.names

Index

df.head()

By default, the index is a series that starts with 0. If your data has unique identifiers, however, it is helpful to use that as the index, especially if you intend on merging your data with other data sources. In this dataframe, each row has a unique value for uid.

df.set_index('uid', inplace=True)
df[:5]

Your turn

In your Titanic dataframe, set the index to the PassengerId column. Confirm that it did want you wanted it to do.


Sample answer code titanic_df.set_index(‘PassengerId’, inplace=True) titanic_df.sample(3)

Subseting

You can view a subset of a dataframe based on the value of a column.

Let’s say that you wanted to look at the cases where the victim’s first name was “Juan”. You could create a new series which is either True or False for each case.

df['victim_first'] == 'JUAN'

You could store this new true/false series. If you placed this in brackets after the name of the dataframe, pandas would display only the rows with a True value.

is_juan = df['victim_first'] == 'JUAN'
df[is_juan]

More commonly, the two statements are combined.

df[df['victim_first'] == 'JUAN']

With this method of subsetting, pandas doesn’t return a new dataframe, but rather is just hiding some of the rows. So if you want to create a new dataframe based on this subset, you need to append copy() to the end.

new_df = df[df['victim_first'] == 'JUAN'].copy()
new_df.head()

As this selection method returns a dataframe, it can be stored. The following creates two dataframes, one with just the 2016 and one with just the 2017 cases.

df_2017 = df[df['year'] == 2017].copy()
df_2016 = df[df['year'] == 2016].copy()


df_2017['year'].value_counts()
df_2016['year'].value_counts()

value_counts confirms that the correct cases were grabbed.

Alternatively you may want to limit your dataset by column. In this case, you create a list of the columns you want. This list is also placed in brackets after the dataframe name.

Your turn

Create a new dataframe with just the female passengers. Check your work.


Sample answer code mask = titanic_df[‘Sex’] == ‘female’ titanic_df_f = titanic_df[mask] titanic_df_f[‘Sex’].value_counts()

More subsets

columns_to_keep = ['victim_last', 'victim_first', 'victim_race', 'victim_age', 'victim_sex']
df[columns_to_keep]

As before, you can you use copy to create a new dataset.

victim_df = df[columns_to_keep].copy()
victim_df.head()

As with the row selection, you don’t need to store the column names in a list first. By convention, these two steps are combined. However, combining the steps does create an awkward pair of double brackets.

place_df = df[['city', 'state', 'lat', 'lon']].copy()
place_df.head()

Merging

There are several different ways to combine datasets. The most straightforward is to merge two different datasets who share a key in common. To merge place_df with victim_df, for example, you can use the datframe merge method.

merged_df = place_df.merge(victim_df, left_index=True, right_index=True)

merged_df.head()

Stacking dataframes

If you have multiple dataframes which each row represents the same kind of object, such as passengers on a ship or murder victims, you can stact them using pd.concat. Stacking is most effective when column names overlap between the datasets. Below, I create a new dataframe by stacking together victims from 2016 and 2017.

df_2016 = df[df['year'] == 2016]
len(df_2016)
recent_df = pd.concat([df_2017, df_2016])
len(recent_df)

New features

New columns in a dataframe can be created in the same manner that new objects are created in Python.

df['birth_year'] = df['year'] - df['victim_age']
df['birth_year'].describe()
df['minor'] = df['victim_age'] <= 18
df['minor'][:10]
df['minor'].mean()

Your turn

Create a new variable in your Titanic dataframe which marks the people who paid a fare in the top 25% of all fares paid.


Sample answer code titanic_df[‘Fare’].describe() threshold = 7.9104

titanic_df[‘Top_25_Fare’] = titanic_df[‘Fare’] > threshold titanic_df[‘Top_25_Fare’].value_counts()

Back to some pandas string manipulation fun.

As discussed in the first notebook, Python programming involves creating functions, such as one to do a simple string manipulation.

def title_case(text):
    return text.title()
title_case('JUAN')

The apply magic

You can apply a function to a dataframe column in order to transfrom each value. The results can also be stored as a new feature.

df['victim_first'].apply(title_case)
df['victim_first2'] = df['victim_first'].apply(title_case)
df['victim_first2'].sample(4)
df[['victim_first', 'victim_first2']].sample(5)

Your turn

Write a function that extracts the last name from the name field on your Titanic dataframe. Create a new variable called Family_Name to store the results. What is the most common family name?


Sample answer code def make_name(name): family_name = name.split(‘, ‘)[0] return family_name

titanic_df[‘Family_Name’] = titanic_df[‘Name’].apply(make_name) titanic_df[‘Family_Name’].sample(3)

Working on more than one column (optional)

If you want a function to work with more than one column, you apply the function to the dataframe. The function will evaluate each row, so you have to specify in the function which columns you want to use.

def victim_name(row):
    first_name = row['victim_first']
    last_name  = row['victim_last']
    name       = last_name + ', ' + first_name
    name       = title_case(name)
    return name
df.apply(victim_name, axis=1)

Note that we include axis=1 because we want the function to be applied to each row. The default is to apply the function to each column.

df['victim_name'] = df.apply(victim_name, axis=1)
cols_2_show = ['victim_first', 'victim_last', 'victim_name']

df[cols_2_show].sample(5)

Your turn

Write a function that tags all titanic passgers who are under 18 and who are traveling without parents (parch == 0)? Use the function to create a new variable.


Sample answer code def unac_minor(row): minor = row[‘Age’] < 18 no_parents = row[‘Parch’] == 0 if minor == True and no_parents == True: return True return False titanic_df[‘Unaccomp_Minor’] = titanic_df.apply(unac_minor, axis=1) titanic_df[‘Unaccomp_Minor’].value_counts()

Congratulations! You’ve now been introduced to the basics of data management in pandas for social scientists.

Avatar
Neal Caren
Associate Professor of Sociology

My research interests include social movements, protest events, web scraping, and text analysis.