Tuesday, November 3, 2015

Python for Data Analysis Part 8: Numpy Arrays


* Edit Jan 2021: I recently completed a YouTube video covering topics in this post:




Python's built in data structures are great for general-purpose programming, but they lack specialized features we'd like for data analysis. For example, adding rows or columns of data in an element-wise fashion and performing math operations on two dimensional tables (matrices) are common tasks that aren't readily available with Python's base data types. In this lesson we'll learn about ndarrays, a data structure available Python's numpy library that implements a variety of useful functions for analyzing data.

Numpy and ndarray Basics

The numpy library is one of the core packages in Python's scientific software stack. Many other Python data analysis libraries require numpy as a prerequisite, because they use its ndarray data structure as a building block. The Anaconda Python distribution we installed in part 1 comes with numpy.

Numpy implements a data structure called the N-dimensional array or ndarray. ndarrays are similar to lists in that they contain a collection of items that can be accessed via indexes. On the other hand, ndarrays are homogeneous, meaning they can only contain objects of the same type and they can be multi-dimensional, making it easy to store 2-dimensional tables or matrices.

To work with ndarrays, we need to load the numpy library. It is standard practice to load numpy with the alias "np" like so:
In [1]:
import numpy as np
The "as np" after the import statement lets us access the numpy library's functions using the shorthand "np."
Create an ndarray by passing a list to np.array() function:
In [2]:
my_list = [1, 2, 3, 4]             # Define a list

my_array = np.array(my_list)       # Pass the list to np.array()

type(my_array)                     # Check the object's type
Out[2]:
numpy.ndarray
To create an array with more than one dimension, pass a nested list to np.array():
In [3]:
second_list = [5, 6, 7, 8]

two_d_array = np.array([my_list, second_list])

print(two_d_array)
[[1 2 3 4]
 [5 6 7 8]]
An ndarray is defined by the number of dimensions it has, the size of each dimension and the type of data it holds. Check the number and size of dimensions of an ndarray with the shape attribute:
In [4]:
two_d_array.shape
Out[4]:
(2, 4)
The output above shows that this ndarray is 2-dimensional, since there are two values listed, and the dimensions have length 2 and 4. Check the total size (total number of items) in an array with the size attribute:
In [5]:
two_d_array.size
Out[5]:
8
Check the type of the data in an ndarray with the dtype attribute:
In [6]:
two_d_array.dtype
Out[6]:
dtype('int32')
Numpy has a variety of special array creation functions. Some handy array creation functions include:
In [7]:
# np.identity() to create a square 2d array with 1's across the diagonal

np.identity(n = 5)      # Size of the array
Out[7]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])
In [8]:
# np.eye() to create a 2d array with 1's across a specified diagonal

np.eye(N = 3,  # Number of rows
       M = 5,  # Number of columns
       k = 1)  # Index of the diagonal (main diagonal (0) is default)
Out[8]:
array([[ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.]])
In [9]:
# np.ones() to create an array filled with ones:

np.ones(shape= [2,4])
Out[9]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
In [10]:
# np.zeros() to create an array filled with zeros:

np.zeros(shape= [4,6])
Out[10]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

Array Indexing and Slicing

Numpy ndarrays offer numbered indexing and slicing syntax that mirrors the syntax for Python lists:
In [11]:
one_d_array = np.array([1,2,3,4,5,6])

one_d_array[3]        # Get the item at index 3
Out[11]:
4
In [12]:
one_d_array[3:]       # Get a slice from index 3 to the end
Out[12]:
array([4, 5, 6])
In [13]:
one_d_array[::-1]     # Slice backwards to reverse the array
Out[13]:
array([6, 5, 4, 3, 2, 1])
If an ndarray has more than one dimension, separate indexes for each dimension with a comma:
In [14]:
# Create a new 2d array
two_d_array = np.array([one_d_array, one_d_array + 6, one_d_array + 12])

print(two_d_array) 
[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]
In [15]:
# Get the element at row index 1, column index 4

two_d_array[1, 4]
Out[15]:
11
In [16]:
# Slice elements starting at row 2, and column 5

two_d_array[1:, 4:]
Out[16]:
array([[11, 12],
       [17, 18]])
In [17]:
# Reverse both dimensions (180 degree rotation)

two_d_array[::-1, ::-1]
Out[17]:
array([[18, 17, 16, 15, 14, 13],
       [12, 11, 10,  9,  8,  7],
       [ 6,  5,  4,  3,  2,  1]])

Reshaping Arrays

Numpy has a variety of built in functions to help you manipulate arrays quickly without having to use complicated indexing operations.
Reshape an array into a new array with the same data but different structure with np.reshape():
In [18]:
np.reshape(a=two_d_array,        # Array to reshape
           newshape=(6,3))       # Dimensions of the new array
Out[18]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])
Unravel a multi-dimensional into 1 dimension with np.ravel():
In [19]:
np.ravel(a=two_d_array,
         order='C')         # Use C-style unraveling (by rows)
Out[19]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])
In [20]:
np.ravel(a=two_d_array,
         order='F')         # Use Fortran-style unraveling (by columns)
Out[20]:
array([ 1,  7, 13,  2,  8, 14,  3,  9, 15,  4, 10, 16,  5, 11, 17,  6, 12,
       18])
Alternatively, use ndarray.flatten() to flatten a multi-dimensional into 1 dimension and return a copy of the result:
In [21]:
two_d_array.flatten()
Out[21]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])
Get the transpose of an array with ndarray.T:
In [22]:
two_d_array.T
Out[22]:
array([[ 1,  7, 13],
       [ 2,  8, 14],
       [ 3,  9, 15],
       [ 4, 10, 16],
       [ 5, 11, 17],
       [ 6, 12, 18]])
Flip an array vertically or horizontally with np.flipud() and np.fliplr() respectively:
In [23]:
np.flipud(two_d_array)
Out[23]:
array([[13, 14, 15, 16, 17, 18],
       [ 7,  8,  9, 10, 11, 12],
       [ 1,  2,  3,  4,  5,  6]])
In [24]:
np.fliplr(two_d_array)
Out[24]:
array([[ 6,  5,  4,  3,  2,  1],
       [12, 11, 10,  9,  8,  7],
       [18, 17, 16, 15, 14, 13]])
Rotate an array 90 degrees counter-clockwise with np.rot90():
In [25]:
np.rot90(two_d_array,
         k=1)             # Number of 90 degree rotations
Out[25]:
array([[ 6, 12, 18],
       [ 5, 11, 17],
       [ 4, 10, 16],
       [ 3,  9, 15],
       [ 2,  8, 14],
       [ 1,  7, 13]])
Shift elements in an array along a given dimension with np.roll():
In [26]:
np.roll(a= two_d_array,
        shift = 2,        # Shift elements 2 positions
        axis = 1)         # In each row
Out[26]:
array([[ 5,  6,  1,  2,  3,  4],
       [11, 12,  7,  8,  9, 10],
       [17, 18, 13, 14, 15, 16]])
Leave the axis argument empty to shift on a flattened version of the array (shift across all dimensions):
In [27]:
np.roll(a= two_d_array,
        shift = 2)
Out[27]:
array([[17, 18,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15, 16]])
Join arrays along an axis with np.concatenate():
In [28]:
array_to_join = np.array([[10,20,30],[40,50,60],[70,80,90]])

np.concatenate( (two_d_array,array_to_join),  # Arrays to join
               axis=1)                        # Axis to join upon
Out[28]:
array([[ 1,  2,  3,  4,  5,  6, 10, 20, 30],
       [ 7,  8,  9, 10, 11, 12, 40, 50, 60],
       [13, 14, 15, 16, 17, 18, 70, 80, 90]])

Array Math Operations

Creating and manipulating arrays is nice, but the true power of numpy arrays is the ability to perform mathematical operations on many values quickly and easily. Unlike built in Python objects, you can use math operators like +, -, / and * to perform basic math operations with ndarrays:
In [29]:
two_d_array + 100    # Add 100 to each element
Out[29]:
array([[101, 102, 103, 104, 105, 106],
       [107, 108, 109, 110, 111, 112],
       [113, 114, 115, 116, 117, 118]])
In [30]:
two_d_array - 100    # Subtract 100 from each element
Out[30]:
array([[-99, -98, -97, -96, -95, -94],
       [-93, -92, -91, -90, -89, -88],
       [-87, -86, -85, -84, -83, -82]])
In [31]:
two_d_array * 2      # Multiply each element by 2
Out[31]:
array([[ 2,  4,  6,  8, 10, 12],
       [14, 16, 18, 20, 22, 24],
       [26, 28, 30, 32, 34, 36]])
In [32]:
two_d_array / 2      # Divide each element by 2
Out[32]:
array([[ 0.5,  1. ,  1.5,  2. ,  2.5,  3. ],
       [ 3.5,  4. ,  4.5,  5. ,  5.5,  6. ],
       [ 6.5,  7. ,  7.5,  8. ,  8.5,  9. ]])
In [33]:
two_d_array ** 2      # Square each element
Out[33]:
array([[  1,   4,   9,  16,  25,  36],
       [ 49,  64,  81, 100, 121, 144],
       [169, 196, 225, 256, 289, 324]])
In [34]:
two_d_array % 2       # Take modulus of each element 
Out[34]:
array([[1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0]], dtype=int32)
Beyond operating on each element of an array with a single scalar value, you can also use the basic math operators on two arrays with the same shape. When operating on two arrays, the basic math operators function in an element-wise fashion, returning an array with the same shape as the original:
In [35]:
small_array1 = np.array([[1,2],[3,4]])

small_array1 + small_array1
Out[35]:
array([[2, 4],
       [6, 8]])
In [36]:
small_array1 - small_array1
Out[36]:
array([[0, 0],
       [0, 0]])
In [37]:
small_array1 * small_array1
Out[37]:
array([[ 1,  4],
       [ 9, 16]])
In [38]:
small_array1 / small_array1
Out[38]:
array([[ 1.,  1.],
       [ 1.,  1.]])
In [39]:
small_array1 ** small_array1
Out[39]:
array([[  1,   4],
       [ 27, 256]], dtype=int32)
Numpy also offers a variety of named math functions for ndarrays. There are too many to cover in detail here, so we'll just look at a selection of the most useful ones for data analysis:
In [40]:
# Get the mean of all the elements in an array with np.mean()

np.mean(two_d_array)
Out[40]:
9.5
In [41]:
# Provide an axis argument to get means across a dimension

np.mean(two_d_array,
        axis = 1)     # Get means of each row
Out[41]:
array([  3.5,   9.5,  15.5])
In [42]:
# Get the standard deviation all the elements in an array with np.std()

np.std(two_d_array)
Out[42]:
5.1881274720911268
In [43]:
# Provide an axis argument to get standard deviations across a dimension

np.std(two_d_array,
        axis = 0)     # Get stdev for each column
Out[43]:
array([ 4.89897949,  4.89897949,  4.89897949,  4.89897949,  4.89897949,
        4.89897949])
In [44]:
# Sum the elements of an array across an axis with np.sum()

np.sum(two_d_array, 
       axis=1)        # Get the row sums
Out[44]:
array([21, 57, 93])
In [45]:
np.sum(two_d_array,
       axis=0)        # Get the column sums
Out[45]:
array([21, 24, 27, 30, 33, 36])
In [46]:
# Take the log of each element in an array with np.log()

np.log(two_d_array)
Out[46]:
array([[ 0.        ,  0.69314718,  1.09861229,  1.38629436,  1.60943791,
         1.79175947],
       [ 1.94591015,  2.07944154,  2.19722458,  2.30258509,  2.39789527,
         2.48490665],
       [ 2.56494936,  2.63905733,  2.7080502 ,  2.77258872,  2.83321334,
         2.89037176]])
In [47]:
# Take the square root of each element with np.sqrt()

np.sqrt(two_d_array)
Out[47]:
array([[ 1.        ,  1.41421356,  1.73205081,  2.        ,  2.23606798,
         2.44948974],
       [ 2.64575131,  2.82842712,  3.        ,  3.16227766,  3.31662479,
         3.46410162],
       [ 3.60555128,  3.74165739,  3.87298335,  4.        ,  4.12310563,
         4.24264069]])
Take the dot product of two arrays with np.dot(). This function performs an element-wise multiply and then a sum for 1-dimensional arrays (vectors) and matrix multiplication for 2-dimensional arrays.
In [48]:
# Take the vector dot product of row 0 and row 1

np.dot(two_d_array[0,0:],  # Slice row 0
       two_d_array[1,0:])  # Slice row 1
Out[48]:
217
In [49]:
# Do a matrix multiply

np.dot(small_array1, small_array1)
Out[49]:
array([[ 7, 10],
       [15, 22]])
The package includes a variety of more advanced linear algebra functions, should you need to do things like computing eigenvectors and eigenvalues or inverting matrices.

Wrap Up

Numpy's ndarray data structure provides many desirable features for working with data, such as element-wise math operations and a variety of functions that work on 2D arrays. Since numpy was built with data analysis in mind, its math operations are optimized for that purpose and are generally faster than what could be achieved if you hand-coded functions to carry out similar operations on lists.

Numpy's arrays are great for performing calculations on numerical data, but most data sets you encounter in real life aren't homogeneous. Many data sets include a mixture of data types including numbers, text and dates, so they can't be stored in a single numpy array. In the next lesson we'll conclude our study of Python data structures with Pandas DataFrames, a powerful data container that mirrors the structure of data tables you'd find in databases and spreadsheet programs like Microsoft Excel.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.