Last lesson we took a whirlwind tour of R's built in plotting functions. Built in functions can take you a long way, but a dedicated plotting package can give you access to more advanced plotting capabilities and nicer aesthetics. The ggplot2 package is a popular graphics library in R that lets you take your plots to the next level.
First, let's install and load ggplot2:
In [1]:
# install.packages("ggplot2") # Uncomment and run if you need gpplot2
library("ggplot2")
ggplot2 Basics and qplot()
The ggplot2 package is based on the principle that all plots consist of a few basic components: data, a coordinate system and a visual representation of the data. In ggplot2, you built plots incrementally, starting with the data and coordinates you want to use and then specifying the graphical features: lines, points, bars, color, etc.
The ggplot 2 package has two plotting functions qplot() (quick plot) and ggplot() (grammar of graphics plot.). The qplot() function is similar to the base R plot() function in that it only requires a single function call and it can create several different types of plots. qplot() can be useful for quick plotting, but it doesn't allow for as much flexibility as ggplot().
We are not going to spend much time learning about qplot() since learning the ggplot() syntax is at the heart of the package. Let's look at one qplot for illustrative purposes and then move on:
Using ggplot()
The ggplot() function creates plots incrementally in layers. Every ggplot starts with the same basic syntax:
In [3]:
ggplot(data=diamonds, # call to ggplot() and data frame to work with
aes(x=carat, y=price)) # aesthetics to assign
In the code above, we specify the data we want to work with and then assign the variables of interest, carat and price, to the x and y values of the plot. "aes()" is an aesthetics wrapper used in ggplot to map variables to visual properties. When you want a visual property to change based on the value of a variable, that specification belongs inside an aes() wrapper. If you are setting a fixed value that doesn't change based a variable, it belongs outside of aes().
Note that running the code above didn't actually produce a plot. When you use the ggplot() syntax, the call to ggplot() initializes the plot, but nothing is actually plotted until you add a visual layer. Let's add a layer of points to the plot using geom_point():
*Note: Add a new element to a plot by putting a "+" after the preceding element.
The layers you add determine the type of plot you create. In this case, we used geom_point() which simply draws the data as points at the specified x and y coordinates, creating a scatterplot. ggplot2 has a wide range of geoms to create different types of plots. Here is a list of geoms for all the plot types we covered in the last lesson, plus a few more:
In [5]:
geom_histogram() # histogram
geom_density() # density plot
geom_boxplot() # boxplot
geom_violin() # violin plot (combination of boxplot and density plot)
geom_bar() # bar graph
geom_point() # scatterplot
geom_jitter() # scatterplot with points randomly perturbed to reduce overlap
geom_line() # line graph
geom_errorbar() # Add error bar
geom_smooth() # Add a best-fit line
geom_abline() # Add a line with specified slope and intercept
Notice the scatterplot we made above didn't have the nice coloring we had in the qplot(). We could have assigned colors to the points based on the clarity variable by adding an aesthetics mapping when we added the geom_point() layer:
Note how many of the data points overlap. One way to get a better sense of overlapping data is to make the data points partially transparent. You can specify transparency with the alpha parameter
*Note: We pass alpha in as an argument outside of the aes() mapping because we are setting alpha to a fixed value instead of mapping it to a variable.
By setting alpha to 0.1, each data point has 90% transparency. At such high transparency, single data points are hard to see, but it lets us focus on high density areas. Let's focus in on the higher density areas even further by limiting the range of the X axis to 2.5 carats.
Note that xlim and ylim delete any points that lie outside the specified plot range which can result in warning messages.
More Plot Examples
Now that we know the basics of creating plots with ggplot(), let's remake some of the plots we created last time and see how they look in ggplot2, starting with a histogram:
The scales and background on ggplot2 look a bit nicer than the base plotting functions.
Now let's make side by side boxplots split on clarity with an extra twist: let's also include the data points as a layer behind the boxplots. We can add the points with geom_jitter(). geom_jitter() is similar to geom_point() except that it adds a little random variation (jitter) that spreads data points apart so they don't overlap as much as they would otherwise. In the case of a boxplot, jitter spreads data points horizontally so we should notice thick bands of points at common carat sizes.
In [10]:
# Create boxplot of carat split on clarity with points added
ggplot(data=diamonds, aes(x=clarity, y=carat)) + # Initialize plot
geom_jitter(alpha=0.05, # Add jittered data points with transparency
color="yellow") + # Set data point color
geom_boxplot(outlier.shape=1, # Create boxplot and set outlier shape
alpha = 0 ) # Make inner boxplot area transparent
Adding jittered data points and then drawing the boxplots on top of them gives us a better sense of the distributions than boxpots alone. We can clearly see bands of data points at certain carat sizes like 1, 1.5 and 2. Let's investigate the distributions further by creating a violin plot:
A violin plot is a mixture of a boxplot and a density plot. The shape of the plots give us a sense of where the bulk of the data is clustered.
Now let's make a grouped barplot that looks like the one we made last time:
In [12]:
ggplot(data=diamonds, aes(x=clarity)) + # Initialize plot
geom_bar(aes(fill=color), # Create bar plot, fill based on diamond color
color="black", # Set bar outline color
position="dodge") + # Place bars side by side
scale_fill_manual(values=c("#FFFFFF","#F5FCC2", # Use custom fill colors
"#E0ED87","#CCDE57", "#B3C732","#94A813","#718200"))
The syntax for ggplot is a little more verbose than base R plotting, but the result is a plot that is crisper with helpful gridlines. The logical and incremental ggplot2 syntax also give you finer-grained control over your plots.
Now let's make a density plot of carat weight. Instead of making a simple density curve like we did last lesson, let's make a stacked density plot that sections the density curve based on diamond cut:
*Note: limiting the x-axis eliminated some values.
Stacked density charts can be a little messy, but the plot gives us a sense of how diamond cuts vary based on the size of the diamond. It appears that ideal cut diamonds tend to be small while larger diamonds are more likely to have low cut grades.
Finally, let's remake the line plot we created last time using ggplot2:
In [14]:
years <- seq(1950,2015,1) # Create some dummy data
readings <- (years-1900) + runif(66,0,20)
data <- data.frame(years,readings)
ggplot(data=data, aes(x=years,y=readings)) + # Initialize plot
geom_line(color="red", # Draw a line plot
size = 1) +
geom_point(shape=10, # Display the points
size=2.5) +
geom_smooth(method=lm) + # Add a linear best fit line
xlab("Year") + ylab("Readings") + # Change axis labels
ggtitle("Example Time Series Plot") # Add a title
Multidimensional Plotting and Faceting
One of the most powerful aspects of plots is the ability to visually illustrate relationships between 3 or more variables. When we create a plot, each different dimension (variable) needs to map to a different perceptual feature (aesthetic) such as x position, y position, symbol, size or color. Making use of several of these aesthetics at once lets us make plots involving many dimensions. We've already seen some examples of multidimensional plots, such as the first scatterplot in this lesson that displayed carat weight and price colored by clarity.
Faceting is another way to add an extra dimension to a plot. Faceting breaks a plot up based on a factor variable and draws a different plot for each level of the factor. You can create a faceted plot in ggplot2 by adding a facet_wrap() layer:
*Note: geom_smooth() uses a locally weighted fitting function by default, which can curve to fit the data.
This plot gives us some extra insight into the impact clarity has on price: at given carat weights, higher clarity diamonds tend to fetch higher prices. Also note that within each facet, diamonds with better color tend to be at the top end of the price spectrum at given carat weights.
Scales
Scales are parameters in ggplot2 that determine how a plot maps values to visual properties (aesthetics.). If you don't specify a scale for an aesthetic the plot will use a default scale. For instance, the plots we split on color all used a default color scale. You can specify custom scales by adding scale elements to your plot. Scale elements have the following structure:
scale_aesthetic_scaletype()
We already saw an example of a scale when made the grouped barplot above. In that case we wanted to manually set the fill color scale for the bars, so the scale we used was:
scale_fill_manual()
Let's make a new scatterplot with several aesthetic properties and alter some of the scales:
In [16]:
ggplot(data=diamonds, aes(x=carat, y=price)) + # Initialize plot
geom_point(aes(size = carat, # Size points based on carat
color = color, # Color based on diamond color
alpha = clarity)) + # Set transparency based on diamond clarity
scale_color_manual( values=c("#FFFFFF","#F5FCC2", # Use manual color values
"#E0ED87","#CCDE57",
"#B3C732","#94A813",
"#718200")) +
scale_alpha_manual(values = c(0.1,0.15,0.2, # Use manual alpha values
0.3,0.4,0.6,
0.8,1)) +
scale_size_identity() + # Set size values to the actual values of carat*
xlim(0,2.5) + # Limit x-axis
theme(panel.background = element_rect(fill = "#7FB2B8")) + # Change background color
theme(legend.key = element_rect(fill = '#7FB2B8')) # Change legend background color
*Note: Scale "identity" makes ggplot use the actual values of the variable associated with the given aesthetic property to scale the property. In this case carat has been assigned to the size property, so the sizes of the points are based on the values of the carat variable.
The plot above just scratches the surface of what is possible with scales and other plot options. A comprehensive look plot options is outside the scope of this brief overview of ggplot2; check out this ggplot2 reference sheet for a summary of common plotting layers and options.
Saving Plots
You can save plots you make with ggplot2 in RStudio the same way you save base R plots: click "export" in the plots pane in the bottom right corner of RStudio and select your desired output format and location. Alternatively, you can run the ggsave() command to save the most recent plot to a file:
In [17]:
ggsave("my_plot.png", # Name of file where you wish to save the plot
width = 7, # Width of plot image in inches
height = 7) # Height of plot image in inches
Wrap Up
The ggplot2 package can do everything we learned to do with R's base plotting functions and then some. The scaling and axes in ggplot2 make charts easier to read and while the syntax is a more verbose, it follows a logical structure that gives you a greater level of control and makes it easier to incorporate new features into your plots. Base R's plotting functions are still useful for rapid plotting and first-brush data exploration, but ggplot2 is there if you need to take your plots to the next level.













No comments:
Post a Comment
Note: Only a member of this blog may post a comment.