Monday, July 20, 2015

Introduction to R Part 5: Vectors


In data analysis, you typically work with large collections of related values rather than singular values. As a language built for statistics and data analysis, R's data structures are designed to make it easy to perform operations on many data values at the same time. R's most basic data structure is the vector. In R, a vector is a sequence of data elements of the same atomic type. You can have numeric vectors, logical vectors, character vectors and so on.
To create and store a vector with specific values, use the c() function and assign the result to a variable. c() takes a comma separated sequence of elements as input and combines them into a vector:
In [1]:
x <- c(1,2,3)  # Create a numeric vector and assign it to x

print(x)  # Print the value of x to the screen

y <- c("Life","Is","Study")  # Create a character vector

print(y)  # Print y to the screen
[1] 1 2 3
[1] "Life"  "Is"    "Study"
You can also combine two vectors using c():
In [2]:
z <- c(x,y) # Combine vectors x and y

print(z)
[1] "1"     "2"     "3"     "Life"  "Is"    "Study"
If you try to combine vectors of different types as shown above, R will automatically convert the vector into the type that fits best. In this case, the numbers were converted into their character equivalents.

Vector Indexing

When you create a vector, each element in the vector is assigned an index based on its position in the vector. The first element is at index position 1, the second element is at index position 2 and so on.

*Note: unlike many other programming languages, indexes in R start at 1 instead of 0.
When you print a vector to the screen, each line starts with a number in square brackets followed by vector values. The number in square brackets indicates the index of the next value listed on that line. For large vectors, this labeling can be helpful. For instance, consider a vector consisting of 100 random numbers between 0 and 1:
In [3]:
random_data <- runif(100)  # Create a vector of 100 random numbers

print(random_data) # Print the vector
  [1] 0.98951138 0.36060598 0.52293345 0.93111059 0.37830951 0.26978045
  [7] 0.07312900 0.08493329 0.07678034 0.38847843 0.95444386 0.47645089
 [13] 0.77054965 0.84544978 0.04995671 0.41304161 0.53549779 0.80299898
 [19] 0.34913860 0.52978700 0.71755987 0.93479582 0.47863257 0.52260778
 [25] 0.17412996 0.59747831 0.96877258 0.36258665 0.76784649 0.78504509
 [31] 0.80205264 0.37319307 0.87885061 0.84263169 0.16304847 0.94930887
 [37] 0.04848707 0.43490726 0.93744907 0.17032342 0.40747934 0.58174419
 [43] 0.06049339 0.52841298 0.11072902 0.54994256 0.12636119 0.79554840
 [49] 0.61157937 0.96498210 0.18774474 0.32016232 0.40091736 0.66197116
 [55] 0.40708103 0.02948462 0.11419970 0.54868395 0.53080437 0.32640669
 [61] 0.72796294 0.15513461 0.03192753 0.59790381 0.35428897 0.89110414
 [67] 0.14578741 0.52537901 0.36899046 0.58600784 0.92900316 0.09773950
 [73] 0.38242255 0.86185601 0.60148564 0.02891465 0.02340387 0.58594287
 [79] 0.20592396 0.26822700 0.61528679 0.24261661 0.28012578 0.07183831
 [85] 0.21897548 0.41648062 0.55803412 0.43598032 0.54281299 0.02594235
 [91] 0.87916179 0.16172390 0.51564113 0.45867149 0.65932987 0.09261303
 [97] 0.47324894 0.22407506 0.08054340 0.37595069
In this case, having the index counters on the left hand side is a bit more useful as it immediately gives us an idea of the vector's size and keeps it organized.
You can access a specific value in a vector by typing the name of the vector and then wrapping the index associated with the value you want to access in square brackets:
In [4]:
random_data[7]  # Get the value at index 7
Out[4]:
0.0731289994437248
Attempting to access an index that doesn't exist returns NA. NA denotes a missing value.
In [5]:
random_data[200]
Out[5]:
[1] NA
You can access ranges of values by placing a colon between the starting and ending indices of the range:
In [6]:
subset1 <- random_data[7:14]  # Get values from index 7 to 14

print(subset1)
[1] 0.07312900 0.08493329 0.07678034 0.38847843 0.95444386 0.47645089 0.77054965
[8] 0.84544978
You can even access a specific subset of values by wrapping a vector in the square brackets:
In [7]:
subset2 <- random_data[c(1,10,100)] #Get the first, tenth and 100th values

print(subset2)
[1] 0.01105221 0.10679006 0.14817441
A subset of a vector is just a shorter vector. In fact, singular values are technically vectors of length 1, so all of the values we've used up till now were vectors all along! You can check the length of a vector with the length() function:
In [8]:
length(10)  # A singular value is a vector of length 1

length(random_data)
Out[8]:
1
Out[8]:
100
Here are a few other useful ways to index into vectors:
In [9]:
# Adding a minus sign excludes a given index:

y <- c("Life","Is","Study")
y <- y[-2]                   # Exclude index 2
print(y)

# A minus sign can also exclude a given range of indices:

random_data <- runif(50)                # Generate 50 random numbers
random_data_sub <- random_data[-(2:49)] # Exclude the range 2 through 49
print(random_data_sub)
[1] "Life"  "Study"
[1] 0.9310588 0.9343361
You can also index a vector with a logical vector of the same length. In this case, the subset is created from each index where the corresponding logical vector is TRUE. Indexing with a logical vector is a common way to filter a numeric or character vector for values that fulfill certain criteria:
In [10]:
# Create a logical vector identifying values over 0.5 in random_data

logical_over_half <- (random_data > 0.5)
print(logical_over_half)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
[13]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
[37] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[49]  TRUE  TRUE
In [11]:
# Use the logical vector to create a subset of the values over 0.5
over_half <- random_data[logical_over_half]

print(over_half)
 [1] 0.9310588 0.6005056 0.8426753 0.6333653 0.6794359 0.6671321 0.6056376
 [8] 0.7295930 0.7253316 0.9647560 0.5116528 0.5139600 0.6368360 0.9285075
[15] 0.9497309 0.5536867 0.8586659 0.7062961 0.6980738 0.6865596 0.7924722
[22] 0.8511161 0.7837691 0.9343361
In [12]:
# Use the logical vector and the not symbol (!) to get values under 0.5

under_half <- random_data[!logical_over_half]

print(under_half)
 [1] 0.112455658 0.122097557 0.444164348 0.256208222 0.471102004 0.085661487
 [7] 0.490340223 0.374135340 0.463295560 0.008141117 0.348802592 0.356524162
[13] 0.151246005 0.389616975 0.140467283 0.428449486 0.499888984 0.229278246
[19] 0.316494073 0.392523017 0.414177036 0.449876752 0.166306581 0.219622926
[25] 0.194583106 0.360589221
In [13]:
# You can perform logical indexing all in one step:

random_data[random_data > 0.5]
Out[13]:
  1. 0.931058811955154
  2.  
  3. 0.600505637004972
  4.  
  5. 0.842675290303305
  6. 0.633365277899429
  7.  
  8. 0.679435939760879
  9.  
  10. 0.667132148053497
  11. 0.605637565720826
  12.  
  13. 0.729593022493646
  14.  
  15. 0.725331601221114
  16. 0.964756017550826
  17.  
  18. 0.511652782326564
  19.  
  20. 0.51396003481932
  21. 0.636835966492072
  22.  
  23. 0.92850752780214
  24.  
  25. 0.949730948777869
  26. 0.553686668165028
  27.  
  28. 0.858665884938091
  29.  
  30. 0.706296145915985
  31. 0.69807382975705
  32.  
  33. 0.686559553490952
  34.  
  35. 0.792472184635699
  36. 0.851116125239059
  37.  
  38. 0.78376911743544
  39.  
  40. 0.934336145874113
In [14]:
# You can also use more complicated logical expressions.
# In this case we grab all values between 0.4 and 0.6:

random_data[(random_data < 0.6) & (random_data > 0.4)]
Out[14]:
  1. 0.444164347834885
  2.  
  3. 0.511652782326564
  4.  
  5. 0.471102004405111
  6. 0.51396003481932
  7.  
  8. 0.490340223303065
  9.  
  10. 0.463295560330153
  11. 0.428449485916644
  12.  
  13. 0.49988898425363
  14.  
  15. 0.553686668165028
  16. 0.414177035912871
  17.  
  18. 0.449876751983538
Finally, you can use %in% to create a subset of elements that are contained within some other vector:
In [15]:
my_vector <- c("a","b","c","d","a","a","f")

my_vector[my_vector %in% c("a","c")]
Out[15]:
  1. "a"
  2.  
  3. "c"
  4.  
  5. "a"
  6.  
  7. "a"

Vectorized Operations

One of the biggest benefits of R is that it is built around performing operations on vectors. Many R functions and operations behave in a "vectorized" manner, meaning they act upon each element of a vector individually and return the result of each of the operations in a new vector. Vectorized operations simplify the process of performing the same calculations on related data. All the basic operators and functions we've learned so far that operate on single values work on vectors longer than length 1.
In [16]:
example_vector <- c(1,2,3)

# + adds to each value in the vector
example_vector + 10

# - performs subtraction on each value
example_vector - 10
Out[16]:
  1. 11
  2.  
  3. 12
  4.  
  5. 13
Out[16]:
  1. -9
  2.  
  3. -8
  4.  
  5. -7
Other math operators like *, /, ^ and %% work the same way as do functions like round(), floor() and ceiling():
In [17]:
example_vector2 <- c(1.6, 2.5, 3.5)

round(example_vector2)
      
floor(example_vector2)
Out[17]:
  1. 2
  2.  
  3. 2
  4.  
  5. 4
Out[17]:
  1. 1
  2.  
  3. 2
  4.  
  5. 3
Vectorized operations make it easy to carry out vector transformations quickly without worrying about programming constructs like for and while loops (we'll discuss those more later.). Vector operations that involve two or more vectors are typically executed in an element-wise fashion. For example, if you take two numeric vectors of the same length and add them, the result is a new vector containing the sums of the values at each index:


In [18]:
vector1 <- c(1,2,3,4)
vector2 <- c(10,20,30,40)

print( vector1+vector2 )
[1] 11 22 33 44
In [19]:
# Other math operations also work in this way:

vector1*vector2  # Element-wise multiplication

vector1/vector2  # Element-wise division

vector1 %% vector2  # Element-wise modulus
Out[19]:
  1. 10
  2.  
  3. 40
  4.  
  5. 90
  6.  
  7. 160
Out[19]:
  1. 0.1
  2.  
  3. 0.1
  4.  
  5. 0.1
  6.  
  7. 0.1
Out[19]:
  1. 1
  2.  
  3. 2
  4.  
  5. 3
  6.  
  7. 4
In [20]:
# If you want a vector inner product, use %*%

vector1 %*% vector2
Out[20]:
300
*Note: An inner product is the sum of the element-wise multiplication of two vectors. It always returns a single value.
Vectorized operations can also work on character vectors. Let's consider the function paste() which takes two or more objects as input and concatenates them into a character vector. If you pass paste() character vectors longer than length 1, it combines them in an element-wise fashion:
In [21]:
x <- c("Life","Is","Study")
y <- c("Blogging","Is","Fun")

paste(x,y)
Out[21]:
  1. "Life Blogging"
  2.  
  3. "Is Is"
  4.  
  5. "Study Fun"
The data type conversion functions we discussed in the atomic data types section also work on longer vectors.
In [22]:
x <- c(1,2,3)
print(x)
typeof(x)

x <- as.character(x)
print(x)
typeof(x)
[1] 1 2 3
Out[22]:
"double"
[1] "1" "2" "3"
Out[22]:
"character"

Generating Vectors

Creating vectors by hand with the c() function works fine for short vectors, but it becomes cumbersome quickly when you're working with longer vectors. R includes a variety of convenience functions to generate vectors.
You can generate all whole numbers in a range using a colon:
In [23]:
x <- 1:20 
print(x)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
You can also generate sequences using the seq() function. Seq takes the arguments from, to, and by which specify the starting point, stopping point and size of the sequence increment:
In [24]:
y <- seq(from = 1, to = 20, by = 1)
print(y)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
In [25]:
z <- seq(0, 100, 10)   # You can omit the argument names
print(z)
 [1]   0  10  20  30  40  50  60  70  80  90 100
Use rep() to create a vector of the same value repeated a specified number of times:
In [26]:
r <- rep(x=1, times=20)
print(r)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
As we saw earlier, you can use the runif() function to draw random values from specified range:
In [27]:
x <- runif(n=20, min=0, max=100)
print(x)
 [1] 22.64281 78.29759 90.88900 15.19337 56.41127 45.45043 40.37305 70.48884
 [9] 81.19156 67.32956 71.67856 65.54791 32.59473 99.81622 65.74992 84.84781
[17] 99.12712 60.38651 70.27734 40.84362
The function runif() draws numbers from a uniform distribution, so all values within the range are equally likely. R also has functions for drawing random numbers from other types of distributions, such as rnorm() for the normal distribution, rexp() for the exponential distribution and rbinom() for the binomial distribution. We won't go into these any further right now, but suffice it to say R is very useful if you have to deal with probability distributions.
You can accomplish a surprising amount in R using only vectors and vector commands in the console, but real-world data is usually structured in 2 dimensional tables. Next time we'll learn about R's simplest multi-dimensional object, the matrix.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.