Wednesday, August 19, 2015

Introduction to R Part 21: Descriptive Statistics


Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis. We've already seen several examples of deceptive statistics in earlier lessons, such as means and medians. In this lesson, we'll review some of these functions and explore several new ones.

Measures of Center

Measures of center are statistics that give us a sense of the "middle" of a numeric variable. In other words, centrality measures give you a sense of a typical value you'd expect to see. Common measures of center include the mean, median and mode.
The mean is simply an average: the sum of the values divided by the total number of records. As we've seen before, there are several ways to get means in R:
In [1]:
cars <- mtcars      # Use the mtcars data set

mean(cars$mpg)      # mean() gets the mean for 1 variable
Out[1]:
20.090625
In [2]:
print(colMeans(cars))     # colMeans() gets the means for all columns in a data frame
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 
In [3]:
print(rowMeans(cars))     # rowMeans() gets the means for all rows in a data frame
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
           29.90727            29.98136            23.59818            38.73955 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
           53.66455            35.04909            59.72000            24.63455 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
           27.23364            31.86000            31.78727            46.43091 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
           46.50000            46.35000            66.23273            66.05855 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
           65.97227            19.44091            17.74227            18.81409 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
           24.88864            47.24091            46.00773            58.75273 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
           57.37955            18.92864            24.77909            24.88027 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
           60.97182            34.50818            63.15545            26.26273 
The median of a distribution is the value where 50% of the data lies below it and 50% lies above it. In essence, the median splits the data in half. The median is also known as the 50% percentile since 50% of the observations are found below it. As we've seen previously, you can get the median using the median() function:
In [4]:
median(cars$mpg)
Out[4]:
19.2
To get the median of every column, we can use the apply() function which takes a data object, a function to execute, and a specified margin (rows or columns):
In [5]:
colMedians <- apply(cars,            
                    MARGIN=2,        # Operate on columns
                    FUN = median)    # Use function median

print(colMedians)
    mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
 19.200   6.000 196.300 123.000   3.695   3.325  17.710   0.000   0.000   4.000 
   carb 
  2.000 
Although the mean and median both give us some sense of the center of a distribution, they aren't always the same. The median always gives us a value that splits the data into two halves while the mean is a numeric average so extreme values can have a significant impact on the mean. In a symmetric distribution, the mean and median will be the same. Let's investigate with a density plot
In [6]:
norm_data <- rnorm(100000)              # Generate normally distributed data

plot(density(norm_data))                # Create a density plot

abline(v=mean(norm_data), lwd=5)        # Plot a thick black line at the mean

abline(v=median(norm_data), col="red", lwd=2 )   # Plot a red line at the median


In the plot above the mean and median are both so close to zero that the red median line lies on top of the thicker black line drawn at the mean.
In skewed distributions, the mean tends to get pulled in the direction of the skew, while the median tends to resist the effects of skew:
In [7]:
skewed_data <- rexp(100000,1)              # Generate skewed data

plot(density(skewed_data), xlim=c(0,4))    

abline(v=mean(skewed_data), lwd=5)         

abline(v=median(skewed_data), col="red", lwd=2 )   


The mean is also influenced heavily by outliers while the median resists the influence of outliers:
In [8]:
norm_data <- rnorm(50)             # Generate 50 normally distributed points

outliers <- rnorm(3, mean=15)      # Generate 3 outliers

norm_data <- c(norm_data,outliers)      # Add outliers

plot(density(norm_data))                

abline(v=mean(norm_data), lwd=5)        

abline(v=median(norm_data), col="red", lwd=2 )   


Since the median tends to resist the effects of skewness and outliers, it is known a "robust" statistic. The median generally gives a better sense of the typical value in a distribution with significant skew or outliers.
The mode of a variable is simply the value that appears most frequently. Unlike mean and median, you can take the mode of a categorical variable and it is possible to have multiple modes. R does not include a function to find the mode, since it is not always a particularly useful statistic: oftentimes all the values in variable are unique so the mode is essentially meaningless. You can find the mode of a variable by creating a data table for the variable to get the counts of each value and then getting the variable with the largest count:
In [9]:
data <- c("cat","hat","cat","hat","hat","sat")   # Dummy data

data_table <- table(data)                        # Create table of counts

print(data_table)

max_index <- which.max(data_table)   # Get the index of the variable with the max count

names(data_table)[max_index]     # Use the index to get the mode from the table's names
data
cat hat sat 
  2   3   1 
Out[9]:
"hat"
If you need to repeatedly find the mode, you could wrap these steps into a user-defined function:
In [10]:
mode_function <- function(data){                         # Define new function
    data_table <- table(data)                            # Create data table
    max_index <- which.max(data_table)                   # Find max index
    if (is.numeric(data)){                               # If input is numeric data
        return(as.numeric(names(data_table)[max_index])) # Return output as numeric
    }
    names(data_table)[max_index]                # Otherwise return output as character
}

mode_function(data)
Out[10]:
"hat"
*Note: Base R contains a function called mode() but it does not find the mode of a data set: it checks the type or storage mode of an object.
Let's use our new mode function to find the modes of each column of the mtcars data set by passing it in to the apply function:
In [11]:
colModes <- apply(cars,            
                 MARGIN=2,               # operate on columns
                 FUN = mode_function)    # use function mode_function

print(colModes)
   mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
 10.40   8.00 275.80 110.00   3.07   3.44  17.02   0.00   0.00   3.00   2.00 

Measures of Spread

Measures of spread (dispersion) are statistics that describe how data varies. While measures of center give us an idea of the typical value, measures of spread give us a sense of how much the data tends to diverge from the typical value.
One of the simplest measures of spread is the range. Range is the distance between the maximum and minimum observations:
In [12]:
max(cars$mpg) - min(cars$mpg)     # Subtract min from max to get the range
Out[12]:
23.5
As noted earlier, the median represents the 50th percentile of a data set. A summary of several percentiles can be used to describe a variable's spread. We can extract the minimum value (0th percentile), first quartile (25th percentile), median, third quartile(75th percentile) and maximum value (100th percentile) using the quantile() function:
In [13]:
quantile(cars$mpg)
Out[13]:
0%
10.4
25%
15.425
50%
19.2
75%
22.8
100%
33.9
Since these values are so commonly used to describe data, they are known as the "five number summary" and R has a couple other ways to find them:
In [14]:
fivenum(cars$mpg)   # Get five number summary

summary(cars$mpg)   # Summary() shows the five number summary plus the mean
Out[14]:
  1. 10.4
  2.  
  3. 15.35
  4.  
  5. 19.2
  6.  
  7. 22.8
  8.  
  9. 33.9
Out[14]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 
The quantile() function also lets you check percentiles other than common ones that make up the five number summary. To find percentiles, pass a vector of percentiles to the probs argument:
In [15]:
quantile(cars$mpg,
        probs = c(0.1,0.9))  # get the 10th and 90th percentiles
Out[15]:
10%
14.34
90%
30.09
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile, which encompasses half the data. R has a built in IRQ() fuction:
In [16]:
IQR(cars$mpg)
Out[16]:
7.375
The boxplots we learned to create in the lessons on plotting are just visual representations of the five number summary and IQR:
In [17]:
five_num <- fivenum(cars$mpg)

boxplot(cars$mpg)

text(x=five_num[1], adj=2, labels ="Minimum")
text(x=five_num[2], adj=2.3, labels ="1st Quartile")
text(x=five_num[3], adj=3, labels ="Median")
text(x=five_num[4], adj=2.3, labels ="3rd Quartile")
text(x=five_num[5], adj=2, labels ="Maximum")
text(x=five_num[3], adj=c(0.5,-8), labels ="IQR", srt=90, cex=2)


Variance and standard deviation are two other common measures of spread. The variance of a distribution is the average of the squared deviations (differences) from the mean. Use the built-in function var() to check variance:
In [18]:
var(cars$mpg)   # get variance
Out[18]:
36.3241028225806
The standard deviation is the square root of the variance. Standard deviation can be more interpretable than variance, since the standard deviation is expressed in terms of the same units as the variable in question while variance is expressed in terms of units squared. Use sd() to check the standard deviation:
In [19]:
sd(cars$mpg)    # get standard deviation
Out[19]:
6.0269480520891
Since variance and standard deviation are both derived from the mean, they are susceptible to the influence of data skew and outliers. Median absolute deviation is an alternative measure of spread based on the median, which inherits the median's robustness against the influence of skew and outliers. Use the built in mad() function to find median absolute deviation:
In [20]:
mad(cars$mpg)    # get median absolute deviation
Out[20]:
5.41149

Skewness and Kurtosis

Beyond measures of center and spread, descriptive statistics include measures that give you a sense of the shape of a distribution. Skewness measures the skew or asymmetry of a distribution while kurtosis measures the "peakedness" of a distribution. We won't go into the exact calculations behind skewness and kurtosis, but they are essentially just statistics that take the idea of variance a step further: while variance involves squaring deviations from the mean, skewness involves cubing deviations from the mean and kurtosis involves raising deviations from the mean to the 4th power.
To check skewness and kurtosis, we'll need to download the "e1071" package:
In [21]:
# install.packages("e1071") # Uncomment to install e1071

library(e1071)
After loading the package, you'll have access to the skewness() and kurtosis() functions. First let's create some some dummy data and inspect it:
In [22]:
normal_data <- rnorm(100000)                       # Generate normally distributed data
skewed_data <- c(rnorm(35000,sd=2)+2,rexp(65000))  # Generate skewed data
uniform_data <- runif(100000,0,1)                  # Generate uniformly distributed data
peaked_data <- c(rexp(100000),                     # Generate data with a sharp peak
                (rexp(100000)*-1))


par(mfrow=c(2,2))                          # Make density plots of the distributions*
plot(density(normal_data))
plot(density(skewed_data),xlim=c(-5,5))
plot(density(uniform_data))
plot(density(peaked_data),xlim=c(-5,5))


*Note: par() lets you set various graphical parameters. In this case, mfrow=c(2,2) lets us combine 4 plots into a single plot with 2 rows and 2 columns.
Now let's check the skewness of each of the distributions. Since skewness measures asymmetry, we'd expect to see low skewness for all of the distributions except for the second one, because all the others are roughly symmetric:
In [23]:
skewness(normal_data)
skewness(skewed_data)
skewness(uniform_data)
skewness(peaked_data)
Out[23]:
-0.0016740449062351
Out[23]:
1.00267137516464
Out[23]:
-0.0110650694273595
Out[23]:
-0.0335772340390091
The 3 roughly symmetric distributions have almost zero skewness, while the positively skewed distribution has positive skewness.
Now let's check kurtosis. Since kurtosis measures peakedness, we'd expect the flat (uniform) distribution have low kurtosis while the distributions with sharper peaks should have higher kurtosis.
In [24]:
kurtosis(normal_data)
kurtosis(skewed_data)
kurtosis(uniform_data)
kurtosis(peaked_data)
Out[24]:
0.00170149627159111
Out[24]:
1.81118003887264
Out[24]:
-1.19658472311693
Out[24]:
3.01976189137365
As we can see from the output, the normally distributed data has a kurtosis near zero, the flat distribution has negative kurtosis and the two pointier distributions have positive kurtosis.

Wrap Up

Descriptive statistics help you explore features of your data, like center, spread and shape by summarizing them with numerical measurements. Descriptive statistics help inform the direction of an analysis and let you communicate your insights to others quickly and succinctly. In addition, certain values, like the mean and variance, are used in all sorts of statistical tests and predictive models.
In this lesson we generated a lot of random data to illustrate concepts, but we haven't actually learned much about the functions we've been using to generate random data. In the next lesson, we'll learn about probability distributions, including how to draw random data from them.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.