Tuesday, August 18, 2015

Introduction to R Part 20: Plotting with ggplot2


Last lesson we took a whirlwind tour of R's built in plotting functions. Built in functions can take you a long way, but a dedicated plotting package can give you access to more advanced plotting capabilities and nicer aesthetics. The ggplot2 package is a popular graphics library in R that lets you take your plots to the next level.
First, let's install and load ggplot2:
In [1]:
# install.packages("ggplot2")   # Uncomment and run if you need gpplot2

library("ggplot2")

ggplot2 Basics and qplot()

The ggplot2 package is based on the principle that all plots consist of a few basic components: data, a coordinate system and a visual representation of the data. In ggplot2, you built plots incrementally, starting with the data and coordinates you want to use and then specifying the graphical features: lines, points, bars, color, etc.
The ggplot 2 package has two plotting functions qplot() (quick plot) and ggplot() (grammar of graphics plot.). The qplot() function is similar to the base R plot() function in that it only requires a single function call and it can create several different types of plots. qplot() can be useful for quick plotting, but it doesn't allow for as much flexibility as ggplot().
We are not going to spend much time learning about qplot() since learning the ggplot() syntax is at the heart of the package. Let's look at one qplot for illustrative purposes and then move on:
In [2]:
qplot(x = carat,                            # x variable
      y = price,                            # y variable
      data = diamonds,                      # Data set
      geom = "point",                       # Plot type
      color = clarity,                      # Color points by variable clarity
      xlab = "Carat Weight",                # x label
      ylab = "Price",                       # y label
      main = "Diamond Carat vs. Price");    # Title


Using ggplot()

The ggplot() function creates plots incrementally in layers. Every ggplot starts with the same basic syntax:
In [3]:
ggplot(data=diamonds,             # call to ggplot() and data frame to work with
      aes(x=carat, y=price))      # aesthetics to assign
Error: No layers in plot
In the code above, we specify the data we want to work with and then assign the variables of interest, carat and price, to the x and y values of the plot. "aes()" is an aesthetics wrapper used in ggplot to map variables to visual properties. When you want a visual property to change based on the value of a variable, that specification belongs inside an aes() wrapper. If you are setting a fixed value that doesn't change based a variable, it belongs outside of aes().
Note that running the code above didn't actually produce a plot. When you use the ggplot() syntax, the call to ggplot() initializes the plot, but nothing is actually plotted until you add a visual layer. Let's add a layer of points to the plot using geom_point():
In [4]:
ggplot(data=diamonds, aes(x=carat, y=price)) +  # Initialize plot* 
      geom_point()                           # Add a layer of points (make scatterplot)


*Note: Add a new element to a plot by putting a "+" after the preceding element.
The layers you add determine the type of plot you create. In this case, we used geom_point() which simply draws the data as points at the specified x and y coordinates, creating a scatterplot. ggplot2 has a wide range of geoms to create different types of plots. Here is a list of geoms for all the plot types we covered in the last lesson, plus a few more:
In [5]:
geom_histogram()  # histogram
geom_density()    # density plot
geom_boxplot()    # boxplot
geom_violin()     # violin plot (combination of boxplot and density plot)
geom_bar()        # bar graph
geom_point()      # scatterplot
geom_jitter()     # scatterplot with points randomly perturbed to reduce overlap
geom_line()       # line graph
geom_errorbar()   # Add error bar
geom_smooth()     # Add a best-fit line
geom_abline()     # Add a line with specified slope and intercept
Notice the scatterplot we made above didn't have the nice coloring we had in the qplot(). We could have assigned colors to the points based on the clarity variable by adding an aesthetics mapping when we added the geom_point() layer:
In [6]:
ggplot(data=diamonds, aes(x=carat, y=price)) +  # Initialize plot 
       geom_point(aes(color = clarity))         # Add color based on clarity


Note how many of the data points overlap. One way to get a better sense of overlapping data is to make the data points partially transparent. You can specify transparency with the alpha parameter
In [7]:
ggplot(data=diamonds, aes(x=carat, y=price)) +          # Initialize plot 
       geom_point(aes(color = clarity), alpha = 0.1)    # Add transparency


*Note: We pass alpha in as an argument outside of the aes() mapping because we are setting alpha to a fixed value instead of mapping it to a variable.
By setting alpha to 0.1, each data point has 90% transparency. At such high transparency, single data points are hard to see, but it lets us focus on high density areas. Let's focus in on the higher density areas even further by limiting the range of the X axis to 2.5 carats.
In [8]:
ggplot(data=diamonds, aes(x=carat, y=price)) +  # Initialize plot 
       geom_point(aes(color = clarity), alpha = 0.1)  +  # Add transparency
       xlim(0,2.5)                                       # Specify x-axis range


Note that xlim and ylim delete any points that lie outside the specified plot range which can result in warning messages.

More Plot Examples

Now that we know the basics of creating plots with ggplot(), let's remake some of the plots we created last time and see how they look in ggplot2, starting with a histogram:
In [9]:
# Create a histogram of carat

ggplot(data=diamonds, aes(x=carat)) +      # Initialize plot 

       geom_histogram(fill="skyblue",      # Create histogram with blue bars
                      col="black",         # Set bar outline color to black
                      binwidth = 0.05) +   # Set bin width

       xlim(0,3)                           # Add x-axis limits


The scales and background on ggplot2 look a bit nicer than the base plotting functions.
Now let's make side by side boxplots split on clarity with an extra twist: let's also include the data points as a layer behind the boxplots. We can add the points with geom_jitter(). geom_jitter() is similar to geom_point() except that it adds a little random variation (jitter) that spreads data points apart so they don't overlap as much as they would otherwise. In the case of a boxplot, jitter spreads data points horizontally so we should notice thick bands of points at common carat sizes.
In [10]:
# Create boxplot of carat split on clarity with points added

ggplot(data=diamonds, aes(x=clarity, y=carat)) +  # Initialize plot 

       geom_jitter(alpha=0.05,          # Add jittered data points with transparency
                    color="yellow") +   # Set data point color

       geom_boxplot(outlier.shape=1,     # Create boxplot and set outlier shape
                    alpha = 0  )         # Make inner boxplot area transparent


Adding jittered data points and then drawing the boxplots on top of them gives us a better sense of the distributions than boxpots alone. We can clearly see bands of data points at certain carat sizes like 1, 1.5 and 2. Let's investigate the distributions further by creating a violin plot:
In [11]:
# Create violin plot of carat split on clarity with points added

ggplot(data=diamonds, aes(x=clarity, y=color)) +   # Initialize plot 

       geom_violin(aes(color=clarity, fill=clarity), # Make violin plot with color
                alpha = 0.25)              # Make inner plot area partially transparent


A violin plot is a mixture of a boxplot and a density plot. The shape of the plots give us a sense of where the bulk of the data is clustered.
Now let's make a grouped barplot that looks like the one we made last time:
In [12]:
ggplot(data=diamonds, aes(x=clarity)) +        # Initialize plot 

       geom_bar(aes(fill=color),        # Create bar plot, fill based on diamond color
                color="black",                 # Set bar outline color
                position="dodge") +            # Place bars side by side

       scale_fill_manual(values=c("#FFFFFF","#F5FCC2",     # Use custom fill colors
        "#E0ED87","#CCDE57", "#B3C732","#94A813","#718200"))


The syntax for ggplot is a little more verbose than base R plotting, but the result is a plot that is crisper with helpful gridlines. The logical and incremental ggplot2 syntax also give you finer-grained control over your plots.
Now let's make a density plot of carat weight. Instead of making a simple density curve like we did last lesson, let's make a stacked density plot that sections the density curve based on diamond cut:
In [13]:
ggplot(data=diamonds, aes(x=carat)) +       # Initialize plot 
        xlim(0,2.5)                 +       # Limit the x-axis*

        geom_density(position="stack",      # Create a stacked density chart
                     aes(fill=cut),         # Fill based on cut
                     alpha = 0.5)           # Set transparency


*Note: limiting the x-axis eliminated some values.
Stacked density charts can be a little messy, but the plot gives us a sense of how diamond cuts vary based on the size of the diamond. It appears that ideal cut diamonds tend to be small while larger diamonds are more likely to have low cut grades.
Finally, let's remake the line plot we created last time using ggplot2:
In [14]:
years <- seq(1950,2015,1)                         # Create some dummy data
readings <- (years-1900) + runif(66,0,20)
data <- data.frame(years,readings)


ggplot(data=data, aes(x=years,y=readings)) +       # Initialize plot 

        geom_line(color="red",                     # Draw a line plot
                  size = 1)    +

        geom_point(shape=10,                       # Display the points
                  size=2.5)    +

        geom_smooth(method=lm) +                   # Add a linear best fit line

        xlab("Year") + ylab("Readings") +          # Change axis labels

        ggtitle("Example Time Series Plot")        # Add a title


Multidimensional Plotting and Faceting

One of the most powerful aspects of plots is the ability to visually illustrate relationships between 3 or more variables. When we create a plot, each different dimension (variable) needs to map to a different perceptual feature (aesthetic) such as x position, y position, symbol, size or color. Making use of several of these aesthetics at once lets us make plots involving many dimensions. We've already seen some examples of multidimensional plots, such as the first scatterplot in this lesson that displayed carat weight and price colored by clarity.
Faceting is another way to add an extra dimension to a plot. Faceting breaks a plot up based on a factor variable and draws a different plot for each level of the factor. You can create a faceted plot in ggplot2 by adding a facet_wrap() layer:
In [15]:
ggplot(data=diamonds, aes(x=carat, y=price)) +      # Initialize plot 

        geom_point(aes(color=color),                # Color based on diamond color
                        alpha=0.5)     +

        facet_wrap(~clarity)           +            # Facet on clarity

        geom_smooth()                  +            # Add an estimated fit line*

        theme(legend.position=c(0.85,0.16))         # Set legend position


*Note: geom_smooth() uses a locally weighted fitting function by default, which can curve to fit the data.
This plot gives us some extra insight into the impact clarity has on price: at given carat weights, higher clarity diamonds tend to fetch higher prices. Also note that within each facet, diamonds with better color tend to be at the top end of the price spectrum at given carat weights.

Scales

Scales are parameters in ggplot2 that determine how a plot maps values to visual properties (aesthetics.). If you don't specify a scale for an aesthetic the plot will use a default scale. For instance, the plots we split on color all used a default color scale. You can specify custom scales by adding scale elements to your plot. Scale elements have the following structure:
scale_aesthetic_scaletype()
We already saw an example of a scale when made the grouped barplot above. In that case we wanted to manually set the fill color scale for the bars, so the scale we used was:
scale_fill_manual()
Let's make a new scatterplot with several aesthetic properties and alter some of the scales:
In [16]:
ggplot(data=diamonds, aes(x=carat, y=price)) +  # Initialize plot 
  
  geom_point(aes(size = carat,          # Size points based on carat
                 color = color,         # Color based on diamond color
                 alpha = clarity)) +    # Set transparency based on diamond clarity
                           
  scale_color_manual( values=c("#FFFFFF","#F5FCC2",   # Use manual color values
                               "#E0ED87","#CCDE57", 
                               "#B3C732","#94A813",
                               "#718200")) +
  
  scale_alpha_manual(values = c(0.1,0.15,0.2,         # Use manual alpha values
                                0.3,0.4,0.6,
                                0.8,1)) + 
  
  scale_size_identity() +           # Set size values to the actual values of carat*
  
  xlim(0,2.5) +                     # Limit x-axis
  
  theme(panel.background = element_rect(fill = "#7FB2B8")) +   # Change background color
  
  theme(legend.key = element_rect(fill = '#7FB2B8'))    # Change legend background color


*Note: Scale "identity" makes ggplot use the actual values of the variable associated with the given aesthetic property to scale the property. In this case carat has been assigned to the size property, so the sizes of the points are based on the values of the carat variable.
The plot above just scratches the surface of what is possible with scales and other plot options. A comprehensive look plot options is outside the scope of this brief overview of ggplot2; check out this ggplot2 reference sheet for a summary of common plotting layers and options.

Saving Plots

You can save plots you make with ggplot2 in RStudio the same way you save base R plots: click "export" in the plots pane in the bottom right corner of RStudio and select your desired output format and location. Alternatively, you can run the ggsave() command to save the most recent plot to a file:
In [17]:
ggsave("my_plot.png",   # Name of file where you wish to save the plot
       width = 7,       # Width of plot image in inches
       height = 7)      # Height of plot image in inches

Wrap Up

The ggplot2 package can do everything we learned to do with R's base plotting functions and then some. The scaling and axes in ggplot2 make charts easier to read and while the syntax is a more verbose, it follows a logical structure that gives you a greater level of control and makes it easier to incorporate new features into your plots. Base R's plotting functions are still useful for rapid plotting and first-brush data exploration, but ggplot2 is there if you need to take your plots to the next level.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.