* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:

Numpy's ndarrays well-suited for performing math operations on one and two-dimensional arrays of numeric values, but they fall short when it comes to dealing with heterogeneous data sets. To store data from an external source like an excel workbook or database, we need a data structure that can hold different data types. It is also desirable to be able to refer to rows and columns in the data by custom labels rather than numbered indexes.

The pandas library offers data structures designed with this in mind: the series and the DataFrame. Series are 1-dimensional labeled arrays similar to numpy's ndarrays, while DataFrames are labeled 2-dimensional structures, that essentially function as spreadsheet tables.

Pandas Series

Before we get into DataFrames, we'll take a brief detour to explore pandas series. Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels.

To create a new series, first load the numpy and pandas libraries (pandas is preinstalled with the Anaconda Python distribution.).

In [1]:

import numpy as np
import pandas as pd

*Note: It is common practice to import pandas with the shorthand "pd".

Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of associated indexes to pd.Series():

In [2]:

my_series = pd.Series( data = [2,3,5,4],             # Data
                       index= ['a', 'b', 'c', 'd'])  # Indexes

my_series

Out[2]:

a    2
b    3
c    5
d    4
dtype: int64

You can also create a series from a dictionary, in which case the dictionary keys act as the labels and the values act as the data:

In [3]:

my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}

my_series2 = pd.Series(my_dict)

my_series2

Out[3]:

a    5
b    4
c    8
x    2
dtype: int64

Similar to a dictionary, you can access items in a series by the labels:

In [4]:

my_series["a"]

Out[4]:

Numeric indexing also works:

In [5]:

my_series[0]

Out[5]:

If you take a slice of a series, you get both the values and the labels contained in the slice:

In [6]:

my_series[1:3]

Out[6]:

b    3
c    5
dtype: int64

As mentioned earlier, operations performed on two series align by label:

In [7]:

my_series + my_series

Out[7]:

a     4
b     6
c    10
d     8
dtype: int64

If you perform an operation with two series that have different labels, the unmatched labels will return a value of NaN (not a number.).

In [8]:

my_series + my_series2

Out[8]:

a     7
b     7
c    13
d   NaN
x   NaN
dtype: float64

Other than labeling, series behave much like numpy's ndarrays. A series is even a valid argument to many of the numpy array functions we covered last time:

In [9]:

np.mean(my_series)        # numpy array functions generally work on series

Out[9]:

3.5

In [10]:

np.dot(my_series, my_series)

Out[10]:

DataFrame Creation and Indexing

A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel workbook or SQL database. DataFrames are the defacto standard data structure for working with tabular data in Python; we'll be using them a lot throughout the remainder of this guide.

You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create DataFrames: when passed to pd.DataFrame() a dictionary's keys become column labels and the values become the columns themselves:

In [11]:

# Create a dictionary with some different data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" : pd.Series([4.5, 5, 6.1], 
                                index=["Joe","Bob","Frans"]),
           "siblings" : 1,
           "gender" : "M"}

df = pd.DataFrame(my_dict)   # Convert the dict to DataFrame

df                           # Show the DataFrame

Out[11]:

	age	gender	height	name	siblings	weight
Joe	10	M	4.5	Joe	1	75
Bob	15	M	5.0	Bob	1	123
Frans	20	M	6.1	Frans	1	239

3 rows × 6 columns

Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single number or string, that value is duplicated for every row in the DataFrame (in this case gender is set to "M" for all records and siblings is set to 1.).

Also note that in the DataFrame above, the rows were automatically given indexes that align with the indexes of the series we passed in for the "height" column. If we did not use a series with index labels to create our DataFrame, it would be given numeric row index labels by default:

In [12]:

my_dict2 = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" :[4.5, 5, 6.1],
           "siblings" : 1,
           "gender" : "M"}

df2 = pd.DataFrame(my_dict2)   # Convert the dict to DataFrame

df2                            # Show the DataFrame

Out[12]:

	age	gender	height	name	siblings	weight
0	10	M	4.5	Joe	1	75
1	15	M	5.0	Bob	1	123
2	20	M	6.1	Frans	1	239

3 rows × 6 columns

You can provide custom row labels when creating a DataFrame by adding the index argument:

In [13]:

df2 = pd.DataFrame(my_dict2,
                   index = my_dict["name"] )

df2

Out[13]:

	age	gender	height	name	siblings	weight
Joe	10	M	4.5	Joe	1	75
Bob	15	M	5.0	Bob	1	123
Frans	20	M	6.1	Frans	1	239

3 rows × 6 columns

A DataFrame behaves like a dictionary of Series objects that each have the same length and indexes. This means we can get, add and delete columns in a DataFrame the same way we would when dealing with a dictionary:

In [14]:

# Get a column by name

df2["weight"]

Out[14]:

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int32

Alternatively, you can get a column by label using "dot" notation:

In [15]:

df2.weight

Out[15]:

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int32

In [16]:

# Delete a column

del df2['name']

In [17]:

# Add a new column

df2["IQ"] = [130, 105, 115]

df2

Out[17]:

	age	gender	height	siblings	weight	IQ
Joe	10	M	4.5	1	75	130
Bob	15	M	5.0	1	123	105
Frans	20	M	6.1	1	239	115

3 rows × 6 columns

Inserting a single value into a DataFrame causes it to populate across all the rows.

In [18]:

df2["Married"] = False

df2

Out[18]:

	age	gender	height	siblings	weight	IQ	Married
Joe	10	M	4.5	1	75	130	False
Bob	15	M	5.0	1	123	105	False
Frans	20	M	6.1	1	239	115	False

3 rows × 7 columns

When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN:

In [19]:

df2["College"] = pd.Series(["Harvard"],
                           index=["Frans"])

df2

Out[19]:

	age	gender	height	siblings	weight	IQ	Married	College
Joe	10	M	4.5	1	75	130	False	NaN
Bob	15	M	5.0	1	123	105	False	NaN
Frans	20	M	6.1	1	239	115	False	Harvard

3 rows × 8 columns

You can select both rows or columns by label with df.loc[row, column]:

In [20]:

df2.loc["Joe"]          # Select row "Joe"

Out[20]:

age            10
gender          M
height        4.5
siblings        1
weight         75
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [21]:

df2.loc["Joe","IQ"]     # Select row "Joe" and column "IQ"

Out[21]:

In [22]:

df2.loc["Joe":"Bob" , "IQ":"College"]   # Slice by label

Out[22]:

	IQ	Married	College
Joe	130	False	NaN
Bob	105	False	NaN

2 rows × 3 columns

Select rows or columns by numeric index with df.iloc[row, column]:

In [23]:

df2.iloc[0]          # Get row 0

Out[23]:

age            10
gender          M
height        4.5
siblings        1
weight         75
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [24]:

df2.iloc[0, 5]       # Get row 0, column 5

Out[24]:

In [25]:

df2.iloc[0:2, 5:8]   # Slice by numeric row and column index

Out[25]:

	IQ	Married	College
Joe	130	False	NaN
Bob	105	False	NaN

2 rows × 3 columns

Select rows or columns based on a mixture of both labels and numeric indexes with df.ix[row, column]:

In [26]:

df2.ix[0]           # Get row 0

Out[26]:

age            10
gender          M
height        4.5
siblings        1
weight         75
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [27]:

df2.ix[0, "IQ"]     # Get row 0, column "IQ"

Out[27]:

In [28]:

df2.ix[0:2, ["age", "IQ", "weight"]]  # Slice rows and get specific columns

Out[28]:

	age	IQ	weight
Joe	10	130	75
Bob	15	105	123

2 rows × 3 columns

You can also select rows by passing in a sequence boolean(True/False) values. Rows where the corresponding boolean is True are returned:

In [29]:

boolean_index = [False, True, True]  

df2[boolean_index]

Out[29]:

	age	gender	height	siblings	weight	IQ	Married	College
Bob	15	M	5.0	1	123	105	False	NaN
Frans	20	M	6.1	1	239	115	False	Harvard

2 rows × 8 columns

This sort of logical True/False indexing is useful for subsetting data when combined with logical operations. For example, say we wanted to get a subset of our DataFrame with all persons who are over 12 years old. We can do it with boolean indexing:

In [30]:

# Create a boolean sequence with a logical comparison
boolean_index = df2["age"] > 12

# Use the index to get the rows where age > 12
df2[boolean_index]

Out[30]:

	age	gender	height	siblings	weight	IQ	Married	College
Bob	15	M	5.0	1	123	105	False	NaN
Frans	20	M	6.1	1	239	115	False	Harvard

2 rows × 8 columns

You can do this sort of indexing all in one operation without assigning the boolean sequence to a variable.

In [31]:

df2[ df2["age"] > 12 ]

Out[31]:

	age	gender	height	siblings	weight	IQ	Married	College
Bob	15	M	5.0	1	123	105	False	NaN
Frans	20	M	6.1	1	239	115	False	Harvard

2 rows × 8 columns

Exploring DataFrames

Exploring data is an important first step in most data analyses. DataFrames come with a variety of functions to help you explore and summarize the data they contain.

First, let's load in data set to explore: the mtcars data set. The mtcars data set comes with the ggplot library, a port of a popular R plotting library called ggplot2. ggplot does not come with Anaconda, but you can install it by opening a console (cmd.exe) and running: "pip install ggplot" (close Spyder and other programs before installing new libraries.).

Now we can import the mtcars data from ggplot:

In [32]:

from ggplot import mtcars

type(mtcars)

Out[32]:

pandas.core.frame.DataFrame

Notice that mtcars is loaded as a DataFrame. We can check the dimensions and size of a DataFrame with df.shape:

In [33]:

mtcars.shape      # Check dimensions

Out[33]:

(32, 12)

The output shows that mtars has 32 rows and 12 columns.

We can check the first n rows of the data with the df.head() function:

In [34]:

mtcars.head(6)    # Check the first 6 rows

Out[34]:

	name	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
5	Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

6 rows × 12 columns

Similarly, we can check the last few rows with df.tail()

In [35]:

mtcars.tail(6)   # Check the lst 6 rows

Out[35]:

	name	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.7	0	1	5	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.5	0	1	5	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.5	0	1	5	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.6	0	1	5	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.6	1	1	4	2

6 rows × 12 columns

With large data sets, head() and tail() are useful to get a sense of what the data looks like without printing hundreds or thousands of rows to the screen. Since each row specifies a different car, lets set the row indexes equal to the car name. You can access and assign new row indexes with df.index:

In [36]:

print(mtcars.index, "\n")      # Print original indexes

mtcars.index = mtcars["name"]  # Set index to car name
del mtcars["name"]             # Delete name column

print(mtcars.index)            # Print new indexes

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], dtype='int64') 

Index(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive', 'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D', 'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SLC', 'Cadillac Fleetwood', 'Lincoln Continental', 'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla', 'Toyota Corona', 'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa', 'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'], dtype='object')

You can access the column labels with df.columns:

In [37]:

mtcars.columns

Out[37]:

Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'], dtype='object')

Use the df.describe() command to get a quick statistical summary of your data set. The summary includes the mean, median, min, max and a few key percentiles for numeric columns:

In [38]:

mtcars.ix[:,:6].describe()    # Summarize the first 6 columns

Out[38]:

	mpg	cyl	disp	hp	drat	wt
count	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000
mean	20.090625	6.187500	230.721875	146.687500	3.596563	3.217250
std	6.026948	1.785922	123.938694	68.562868	0.534679	0.978457
min	10.400000	4.000000	71.100000	52.000000	2.760000	1.513000
25%	15.425000	4.000000	120.825000	96.500000	3.080000	2.581250
50%	19.200000	6.000000	196.300000	123.000000	3.695000	3.325000
75%	22.800000	8.000000	326.000000	180.000000	3.920000	3.610000
max	33.900000	8.000000	472.000000	335.000000	4.930000	5.424000

8 rows × 6 columns

Since the columns of a DataFrame are series and series are closely related to numpy's arrays, many ndarray functions work on DataFrames, operating on each column of the DataFrame:

In [39]:

np.mean(mtcars,
        axis=0)          # Get the mean of each column

Out[39]:

mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

In [40]:

np.sum(mtcars,
        axis=0)          # Get the sum of each column

Out[40]:

mpg      642.900
cyl      198.000
disp    7383.100
hp      4694.000
drat     115.090
wt       102.952
qsec     571.160
vs        14.000
am        13.000
gear     118.000
carb      90.000
dtype: float64

Wrap Up

Pandas DataFrames are the workhorse data structure for data analysis in Python. They provide an intuitive structure that mirrors the sorts of data tables we're using to seeing in spreadsheet programs and indexing functionality that follows the same pattern as other Python data structures. This brief introduction only scratches the surface; DataFrames offer a host of other indexing options and functions, many of which we will see in future lessons.

Life Is Study

Wednesday, November 4, 2015

Python for Data Analysis Part 9: Pandas DataFrames

Pandas Series

DataFrame Creation and Indexing

Exploring DataFrames

Wrap Up

Next Time: Python for Data Analysis Part 10: Reading and Writing Data

No comments:

Post a Comment