Last lesson we learned that there are a lot of questions to consider when you first look at a data set, including whether you should clean or transform the data. We touched briefly on a few basic operations to prepare data for analysis, but the Titanic data set was pretty clean to begin with. Data you encounter in the wild won't always be so friendly. Character data in particular can be extremely messy and difficult to work with because it can contain all sorts of characters and symbols that may have little meaning for your analysis. This lesson will cover some basic techniques and functions for working with text data in R.
To start off we'll need some text data that is a little messier than the names in the Titanic data set. As it happens, Kaggle launched a new data exploration competition this week, giving users access to a database of comments made on Reddit.com during the month of May 2015. Since the Minnesota Timberwolves are my favorite basketball team, I extracted the comments from the team's fan subreddit from the database. You can get the data file (comments.csv) here.
Let's start by loading the data and checking its structure and a few of the comments:
In [1]:
setwd("C:/Users/Greg/Desktop/Kaggle")
comments <- read.csv("t_wolves_reddit_may2015.csv", stringsAsFactors = FALSE)
str(comments)
print(head(comments, 8))
The text in these comments is pretty messy. We see everything from long paragraphs to web links to text emoticons. Before we try working with entire comment vector, let's go over a few basic character functions.
Character Functions
You can check the length of a character string with the nchar() function:
In [2]:
first_comment <- comments[1,1] # Get the first comment
nchar(first_comment) # Check the number of characters
Out[2]:
*Note: length() returns the number of elements in entire vector while nchar() checks the number of characters in a character string
We saw in an earlier lesson that you can combine two or more characters with the paste() funciton. Let's combine comments 7, 8 and 9:
In [3]:
paste(comments[7,1], comments[8,1], comments[9,1],
sep = "||") # Separate comments with two pipes (||)
Out[3]:
We also saw that you can get a substring of a character using the substr() function:
In [4]:
substr(first_comment, 100, 150) # Get a substring from index 100 to 150
Out[4]:
You can convert all letters to uppercase or lowercase with the toupper() and tolower() functions respectively:
In [5]:
toupper( substr(first_comment, 100, 150) )
tolower( substr(first_comment, 100, 150) )
Out[5]:
Out[5]:
To split a character string use the strsplit() function:
In [6]:
words <- strsplit(first_comment, # Character to split
split= " ") # String specifying where to make the split
words
Out[6]:
You can replace part of a character string with a different substring using the sub() and gsub() functions. sub() replaces the first occurrence of the specified substring while gsub() replaces all occurrences:
In [7]:
new_string <- "I like Wiggins. Wiggins is the best."
sub(pattern = "Wiggins", # String to replace
replacement = "Towns", # String to use as replacement
x = new_string) # Character vector to search
gsub(pattern = "Wiggins", # String to replace
replacement = "Towns", # String to use as replacement
x = new_string) # Character vector to search
Out[7]:
Out[7]:
In [8]:
wiggins_indexes <- grep(pattern = "wigg|drew", # Substring or pattern to match
x = tolower(comments$body)) # Vector to search
print(wiggins_indexes) # Check the indexes
length(wiggins_indexes)/nrow(comments) # Get the ratio of comments about Wiggins
Out[8]:
It looks like about 6.6% of comments make mention of Wiggins. Notice that this time the pattern argument we supplied wasn't just a simple substring. Posts about Andrew Wiggins could use any number of different names to refer to him--Wiggins, Andrew, Wigg, Drew--so we needed something a little more flexible than a single substring to match the posts we're interested in. The pattern we supplied is a simple example of a regular expression.
R Regular Expressions
A regular expression (regex) is a sequence of characters and special meta characters used to match a set of character strings. Regular expressions allow you to be more expressive with your matching than just providing a simple substring. A regular expression lets you define a "pattern" that can match strings of different lengths, made up of different characters.
In the grep() example above, we supplied the regular expression: "wigg|drew". In this case, the vertical bar | is a metacharacter that acts as the "or" operator, so this regular expression matches any string that contains the substring "wigg" or "drew".
When you provide a regular expression that contains no metacharacters, it simply matches the exact substring. For instance, "Wiggins" would only match strings containing the exact substring "Wiggins." Metacharacters let you change how you make matches. Here is a list of basic metacharacters and what they do:
"." - The period is a metacharacter that matches any character other than a newline:
In [9]:
my_char <- c("will","bill","Till","still","gull")
grepl(pattern = ".ill", # Matches any 4 character substring that ends with "ill"
my_char)
Out[9]:
*Note: grepl() searches for matches just like grep() put returns TRUE/FALSE values if matches are found or not instead of the indexes of matches.
"[ ]" - Square brackets specify a set of characters to match:
In [10]:
grepl(pattern = "[Tt]ill", # Matches T or t followed by "ill"
my_char)
Out[10]:
Regular expressions include several special character sets that allow to quickly specify certain common character types. They include:
In [11]:
[a-z] - match any lowercase letter
[A-Z] - match any uppercase letter
[0-9] - match any digit
[a-zA-Z0-9] - match any letter or digit
Adding the "^" symbol inside the square brackets matches any characters NOT in the set:
[^a-z] - match any character that is not a lowercase letter
[^A-Z] - match any character that is not a uppercase letter
[^0-9] - match any character that is not a digit
[^a-zA-Z0-9] - match any character that is not a letter or digit
R regular expressions also include a shorthand for specifying common sequences:
\\d - match any digit
\\D - match any non digit
\\w - match a word character (same as [a-zA-Z0-9])
\\W - match a non-word character
\\s - match whitespace (spaces, tabs, newlines, etc.)
\\S - match non-whitespace
"^" - outside of a sequence, the caret symbol searches for matches at the beginning of a string:
In [12]:
ex_str1 <- c("Where did he go?", "He went to the Cool Store.", "he is so cool")
grepl(pattern = "^(He|he)", # Matches He or he but only at the start of a string
ex_str1)
Out[12]:
"$" - searches for matches at the end of a string:
In [13]:
grepl(pattern <- "(Cool|cool)$", # Matches Cool or cool at the end of string
ex_str1)
Out[13]:
"( )" - parentheses in regular expressions are used for grouping and to enforce the proper order of operations just like they are in math and logical expressions. In the examples above, the parentheses let us group the or expressions so that the "^" and "$" symbols operate on the entire or statement.
"*" - an asterisk matches zero or more copies of the preceding character
"?" - a question mark matches zero or 1 copy of the preceding character
"+" - a plus matches 1 more more copies of the preceding character
"?" - a question mark matches zero or 1 copy of the preceding character
"+" - a plus matches 1 more more copies of the preceding character
In [14]:
ex_str2 <- c("abdominal","b","aa","abbcc","aba")
grepl(pattern = "a*b.+", # Match 0 or more a's, a single b, then 1 or characters
ex_str2)
grepl(pattern = "a+b?a+", # Match 1 or more a's, an optional b, then 1 or a's
ex_str2)
Out[14]:
Out[14]:
"{ }" - curly braces match a preceding character for a specified number of repetitions:
"{m}" - the preceding element is matched m times
"{m,}" - the preceding element is matched m times or more
"{m,n}" - the preceding element is matched between m and n times
In [15]:
ex_str3 <- c("aabcbcb","abbb","abbaab","aabb")
grepl(pattern = "a{2}b{2,}", # match 2 a's then 2 or more b's
ex_str3)
Out[15]:
"\\" - double backslashes let you "escape" metacharacters. You must escape metacharacters when you actually want to match the metacharacter symbol itself. For instance, if you want to match periods you can't use "." because it is a metacharacter that matches anything. Instead, you'd use "\\." to escape the period's metacharacter behavior and match the period itself.
In [16]:
ex_str4 <- c("Mr. Ed","Dr. Mario","Miss\\Mrs Granger.")
grepl(pattern = "\\. ", # Match a single period and then a space
ex_str4)
Out[16]:
*Note: if you want to match the escape character backslash itself, you have to use four backslashes "\\\\".
There are more regular expression intricacies we won't cover here, but combinations of the few symbols we've covered give you a great amount of expressive power. Regular expressions are commonly used to perform tasks like matching phone numbers, email addresses and web addresses in blocks of text.
It turns out that several of the R functions we've discussed accept regular expression patterns instead of the fixed substrings we've been using. These functions include: grep(), grepl(), sub(), gsub() and strsplit(). Remember, you can always get more details of a function by checking the docs with help().
Getting Posts with Web Links
Now it's time to use some of the new tools we have in our toolbox on the Reddit comment data. Let's say we are only interested in posts that contain web links. If we want to narrow down comments to only those with web links, we'll need to match comments that agree with some pattern that expresses the textual form of a web link. Let's try using a simple regular expression to find posts with web links.
Web links begin with "http:" or "https:" so let's make a regular expression that matches those substrings:
In [17]:
indexes_of_links <- grep("https?:",
comments$body)
posts_with_links <- comments$body[indexes_of_links]
print( length( posts_with_links ) ) # Check number of posts matched
print(head(posts_with_links, 5)) # Check a few of the posts
It appears the comments we've returned all contain web links. It is possible that a post could contain the string "http:" without actually having a web link. If we wanted to reduce this possibility, we'd have to be more specific with our regular expression pattern, but in the case of a basketball-themed forum, it is pretty unlikely.
Now that we've identified posts that contain web links, lets extract the links themselves. Many of the posts contain both web links and a bunch of text the user wrote. We want to get rid of the text keep the web links. It is possible to do this with R's base regular expression functions, but in this case its easier to use a package. Let's install and load the "stringr" package, which provides a variety of convenience functions for working with character data in R:
In [18]:
# Run install.packages to install stringr:
# install.packages("stringr")
library(stringr) # Then load the package
Now we can use stringr function str_extract_all(). This function takes a regex pattern and a character vector and returns a list where each list item is a vector of all the matches found within each string. If some comments include multiple web links, some of the vectors in our list should have more than 1 element:
In [19]:
only_links <- str_extract_all(pattern = "https?:[^ \n\\)]+", # A pattern to match
string = posts_with_links)
print( head(only_links,10)) # Check the head of the list returned
The pattern we used to match web links may look confusing, so let's go over it step by step.
First the pattern matches the exact characters "http", an optional "s" and then ":".
Next, with [^ \n\\)], we create a set of characters to match. Since our set starts with "^", we are actually matching the negation of the set. In this case, the set is the space character, the newline character "\n" and the closing parenthesis character ")". Notice we had to escape the closing parenthesis character by writing "\\)". Since we are matching the negation, this set matches any character that is NOT a space, newline or closing parenthesis. Finally, the "+" at the end matches this set 1 or more times.
To summarize, the regex matches http: or https: at the start and then any number of characters until it encounters a space, newline or closing parenthesis. This regex isn't perfect: a web address could contain parentheses and a space, newline or closing parenthesis might not be the only characters that mark the end of a web link in a comment. It is good enough for this small data set, but for a serious project we would probably want something a little more specific to handle such corner cases.
Complex regular expressions can be difficult to write and confusing to read. Sometimes it is easiest to simply search the web for a regular expression to perform a common task instead of writing one from scratch.
*Note: if you copy a regex written for another language it might not work in R without some modifications.
*Note: if you copy a regex written for another language it might not work in R without some modifications.
Wrap Up
In this lesson we learned several functions for dealing with text data in R and introduced regular expressions, a powerful tool for matching substrings in text. Regular expressions are used in many programming languages and although the syntax for regex varies a bit for one language to another, the basic constructs are similar across languages.
Next time we'll turn our attention to cleaning and preparing numeric data.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.