Thursday, August 6, 2015

Introduction to R Part 14: Working With Character Data


Last lesson we learned that there are a lot of questions to consider when you first look at a data set, including whether you should clean or transform the data. We touched briefly on a few basic operations to prepare data for analysis, but the Titanic data set was pretty clean to begin with. Data you encounter in the wild won't always be so friendly. Character data in particular can be extremely messy and difficult to work with because it can contain all sorts of characters and symbols that may have little meaning for your analysis. This lesson will cover some basic techniques and functions for working with text data in R.
To start off we'll need some text data that is a little messier than the names in the Titanic data set. As it happens, Kaggle launched a new data exploration competition this week, giving users access to a database of comments made on Reddit.com during the month of May 2015. Since the Minnesota Timberwolves are my favorite basketball team, I extracted the comments from the team's fan subreddit from the database. You can get the data file (comments.csv) here.
Let's start by loading the data and checking its structure and a few of the comments:
In [1]:
setwd("C:/Users/Greg/Desktop/Kaggle")  

comments <- read.csv("t_wolves_reddit_may2015.csv", stringsAsFactors = FALSE)

str(comments)

print(head(comments, 8))
'data.frame': 4166 obs. of  1 variable:
 $ body: chr  "Strongly encouraging sign for us.  The T-Wolves management better not screw this up and they better surround Wiggins with a cha"| __truncated__ "[My reaction.](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPmtsN8NdRoU4whICZARz7JAD5lC33JqOhFZLiZbqHTrbau23VJG6E5lTdvKdnDigfomvb3zozn6U9x_e4rfx86Vb2KVNsskGA9s4DHCRIWhvi5qZZDXamkmHZpNgJZb2QEKbJqV-OXbLW/s1600/2.gif)" "http://imgur.com/gallery/Zch2AWw" "Wolves have more talent than they ever had right now." ...
                                                                                                                                                                                                                                                                                                                                         body
1 Strongly encouraging sign for us.  The T-Wolves management better not screw this up and they better surround Wiggins with a championship caliber team to support his superstar potential or else I wouldn't want him to sour his prime years here in Minnesota just like how I felt with Garnett.\n\nTL;DR: Wolves better not fuck this up.
2                                                                                                                                                                                                                                       [My reaction.](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPmtsN8NdRoU4whICZARz7JAD5lC33JqOhFZLiZbqHTrbau23VJG6E5lTdvKdnDigfomvb3zozn6U9x_e4rfx86Vb2KVNsskGA9s4DHCRIWhvi5qZZDXamkmHZpNgJZb2QEKbJqV-OXbLW/s1600/2.gif)
3                                                                                                                                                                                                                                                                                                            http://imgur.com/gallery/Zch2AWw
4                                                                                                                                                                                                                                                                                       Wolves have more talent than they ever had right now.
5                                                                                                                                                                                           Nah. Wigg is on the level of KG but where's our Steph? And our Gugliotta? Neither Bazz or Rubio are as good as Googs or as promising as Starbury.
6                                                                                                                                                                                                                                                                                                  2004 was a pretty damn talented team dude.
7                                                                                                                                                                                                                                                                                                                                         :')
8                                                                                                                                                                                                                                                                                                                                     *swoon*
The text in these comments is pretty messy. We see everything from long paragraphs to web links to text emoticons. Before we try working with entire comment vector, let's go over a few basic character functions.

Character Functions

You can check the length of a character string with the nchar() function:
In [2]:
first_comment <- comments[1,1]      # Get the first comment

nchar(first_comment)                # Check the number of characters
Out[2]:
329
*Note: length() returns the number of elements in entire vector while nchar() checks the number of characters in a character string
We saw in an earlier lesson that you can combine two or more characters with the paste() funciton. Let's combine comments 7, 8 and 9:
In [3]:
paste(comments[7,1], comments[8,1], comments[9,1],
      sep = "||")                          # Separate comments with two pipes (||)
Out[3]:
":')||*swoon*||Is Joe Smith available..?"
We also saw that you can get a substring of a character using the substr() function:
In [4]:
substr(first_comment, 100, 150)           # Get a substring from index 100 to 150
Out[4]:
" surround Wiggins with a championship caliber team "
You can convert all letters to uppercase or lowercase with the toupper() and tolower() functions respectively:
In [5]:
toupper( substr(first_comment, 100, 150)  )

tolower( substr(first_comment, 100, 150)  )
Out[5]:
" SURROUND WIGGINS WITH A CHAMPIONSHIP CALIBER TEAM "
Out[5]:
" surround wiggins with a championship caliber team "
To split a character string use the strsplit() function:
In [6]:
words <- strsplit(first_comment,          # Character to split
                  split= " ")             # String specifying where to make the split
words
Out[6]:
    1. "Strongly"
    2.  
    3. "encouraging"
    4.  
    5. "sign"
    6.  
    7. "for"
    8.  
    9. "us."
    10.  
    11. ""
    12.  
    13. "The"
    14.  
    15. "T-Wolves"
    16. "management"
    17.  
    18. "better"
    19.  
    20. "not"
    21.  
    22. "screw"
    23.  
    24. "this"
    25.  
    26. "up"
    27.  
    28. "and"
    29.  
    30. "they"
    31.  
    32. "better"
    33. "surround"
    34.  
    35. "Wiggins"
    36.  
    37. "with"
    38.  
    39. "a"
    40.  
    41. "championship"
    42.  
    43. "caliber"
    44.  
    45. "team"
    46.  
    47. "to"
    48. "support"
    49.  
    50. "his"
    51.  
    52. "superstar"
    53.  
    54. "potential"
    55.  
    56. "or"
    57.  
    58. "else"
    59.  
    60. "I"
    61.  
    62. "wouldn't"
    63.  
    64. "want"
    65. "him"
    66.  
    67. "to"
    68.  
    69. "sour"
    70.  
    71. "his"
    72.  
    73. "prime"
    74.  
    75. "years"
    76.  
    77. "here"
    78.  
    79. "in"
    80.  
    81. "Minnesota"
    82.  
    83. "just"
    84.  
    85. "like"
    86. "how"
    87.  
    88. "I"
    89.  
    90. "felt"
    91.  
    92. "with"
    93.  
    94. "Garnett. TL;DR:"
    95.  
    96. "Wolves"
    97.  
    98. "better"
    99.  
    100. "not"
    101.  
    102. "fuck"
    103.  
    104. "this"
    105. "up."
You can replace part of a character string with a different substring using the sub() and gsub() functions. sub() replaces the first occurrence of the specified substring while gsub() replaces all occurrences:
In [7]:
new_string <- "I like Wiggins. Wiggins is the best."

sub(pattern = "Wiggins",            # String to replace
    replacement = "Towns",          # String to use as replacement
    x = new_string)                 # Character vector to search

gsub(pattern = "Wiggins",           # String to replace
    replacement = "Towns",          # String to use as replacement
    x = new_string)                 # Character vector to search
Out[7]:
"I like Towns. Wiggins is the best."
Out[7]:
"I like Towns. Towns is the best."


A common operation when working with text data is to test whether character strings contain a certain substring or pattern of characters. For instance, if we were only interested in posts about Andrew Wiggins, we'd need to match all posts that make mention of him and avoid matching posts that don't mention him. Use the grep() function to get the indexes of elements in a character vector that match a given substring:
In [8]:
wiggins_indexes <- grep(pattern = "wigg|drew",           # Substring or pattern to match
                        x = tolower(comments$body))      # Vector to search

print(wiggins_indexes)                                   # Check the indexes

length(wiggins_indexes)/nrow(comments)       # Get the ratio of comments about Wiggins
  [1]    1    5   10   11   34   45   46   63   64   65   76   79   82   85   86
 [16]  112  114  160  185  200  260  265  272  281  283  315  316  318  354  409
 [31]  450  465  496  555  561  580  582  584  592  627  634  656  662  673  674
 [46]  693  715  718  740  775  784  796  817  822  832  859  865  869  871  892
 [61]  899  916  971 1017 1027 1030 1111 1125 1129 1132 1142 1157 1223 1242 1251
 [76] 1296 1325 1328 1334 1354 1375 1377 1380 1384 1391 1394 1518 1587 1621 1640
 [91] 1646 1648 1677 1692 1852 1874 1910 1929 1951 1970 1986 1998 2000 2022 2035
[106] 2056 2075 2081 2098 2100 2106 2151 2180 2188 2194 2248 2263 2269 2299 2302
[121] 2306 2308 2311 2333 2346 2374 2397 2400 2407 2410 2432 2443 2447 2474 2480
[136] 2529 2531 2532 2533 2539 2565 2566 2573 2586 2596 2597 2628 2652 2659 2663
[151] 2673 2725 2736 2768 2778 2779 2788 2811 2876 2891 2903 2934 2963 2969 2971
[166] 2988 3003 3018 3038 3082 3086 3095 3097 3098 3108 3120 3130 3136 3169 3171
[181] 3174 3177 3184 3214 3219 3228 3239 3261 3292 3322 3326 3328 3332 3334 3360
[196] 3384 3386 3448 3454 3459 3474 3491 3510 3524 3530 3538 3540 3555 3573 3574
[211] 3577 3580 3584 3613 3620 3631 3642 3645 3646 3647 3648 3651 3656 3657 3658
[226] 3663 3674 3681 3686 3699 3704 3706 3708 3714 3733 3739 3741 3745 3747 3753
[241] 3783 3791 3792 3806 3825 3833 3837 3839 3842 3854 3860 3864 3868 3884 3900
[256] 3923 3930 3935 3941 3948 3949 3951 3971 3973 3987 3995 4025 4042 4047 4054
[271] 4057 4060 4069 4133 4151 4161 4162
Out[8]:
0.0664906385021603
It looks like about 6.6% of comments make mention of Wiggins. Notice that this time the pattern argument we supplied wasn't just a simple substring. Posts about Andrew Wiggins could use any number of different names to refer to him--Wiggins, Andrew, Wigg, Drew--so we needed something a little more flexible than a single substring to match the posts we're interested in. The pattern we supplied is a simple example of a regular expression.

R Regular Expressions

A regular expression (regex) is a sequence of characters and special meta characters used to match a set of character strings. Regular expressions allow you to be more expressive with your matching than just providing a simple substring. A regular expression lets you define a "pattern" that can match strings of different lengths, made up of different characters.
In the grep() example above, we supplied the regular expression: "wigg|drew". In this case, the vertical bar | is a metacharacter that acts as the "or" operator, so this regular expression matches any string that contains the substring "wigg" or "drew".
When you provide a regular expression that contains no metacharacters, it simply matches the exact substring. For instance, "Wiggins" would only match strings containing the exact substring "Wiggins." Metacharacters let you change how you make matches. Here is a list of basic metacharacters and what they do:
"." - The period is a metacharacter that matches any character other than a newline:
In [9]:
my_char <- c("will","bill","Till","still","gull")
 
grepl(pattern = ".ill",       # Matches any 4 character substring that ends with "ill"
      my_char)
Out[9]:
  1. TRUE
  2.  
  3. TRUE
  4.  
  5. TRUE
  6.  
  7. TRUE
  8.  
  9. FALSE
*Note: grepl() searches for matches just like grep() put returns TRUE/FALSE values if matches are found or not instead of the indexes of matches.
"[ ]" - Square brackets specify a set of characters to match:
In [10]:
grepl(pattern = "[Tt]ill",      # Matches T or t followed by "ill"
      my_char)
Out[10]:
  1. FALSE
  2.  
  3. FALSE
  4.  
  5. TRUE
  6.  
  7. TRUE
  8.  
  9. FALSE
Regular expressions include several special character sets that allow to quickly specify certain common character types. They include:
In [11]:
 [a-z] - match any lowercase letter
 [A-Z] - match any uppercase letter
 [0-9] - match any digit
 [a-zA-Z0-9] - match any letter or digit

 Adding the "^" symbol inside the square brackets matches any characters NOT in the set:

 [^a-z] - match any character that is not a lowercase letter
 [^A-Z] - match any character that is not a uppercase letter
 [^0-9] - match any character that is not a digit
 [^a-zA-Z0-9] - match any character that is not a letter or digit

 R regular expressions also include a shorthand for specifying common sequences:

 \\d - match any digit
 \\D - match any non digit
 \\w - match a word character (same as [a-zA-Z0-9])
 \\W - match a non-word character
 \\s - match whitespace (spaces, tabs, newlines, etc.)
 \\S - match non-whitespace
"^" - outside of a sequence, the caret symbol searches for matches at the beginning of a string:
In [12]:
ex_str1 <- c("Where did he go?", "He went to the Cool Store.", "he is so cool")

grepl(pattern = "^(He|he)",      # Matches He or he but only at the start of a string
      ex_str1)
Out[12]:
  1. FALSE
  2.  
  3. TRUE
  4.  
  5. TRUE
"$" - searches for matches at the end of a string:
In [13]:
grepl(pattern <- "(Cool|cool)$",  # Matches Cool or cool at the end of string
      ex_str1)
Out[13]:
  1. FALSE
  2.  
  3. FALSE
  4.  
  5. TRUE
"( )" - parentheses in regular expressions are used for grouping and to enforce the proper order of operations just like they are in math and logical expressions. In the examples above, the parentheses let us group the or expressions so that the "^" and "$" symbols operate on the entire or statement.
"*" - an asterisk matches zero or more copies of the preceding character

"?" - a question mark matches zero or 1 copy of the preceding character

"+" - a plus matches 1 more more copies of the preceding character
In [14]:
ex_str2 <- c("abdominal","b","aa","abbcc","aba")

grepl(pattern = "a*b.+",    # Match 0 or more a's, a single b, then 1 or characters
      ex_str2)

grepl(pattern = "a+b?a+",   # Match 1 or more a's, an optional b, then 1 or a's
      ex_str2)
Out[14]:
  1. TRUE
  2.  
  3. FALSE
  4.  
  5. FALSE
  6.  
  7. TRUE
  8.  
  9. TRUE
Out[14]:
  1. FALSE
  2.  
  3. FALSE
  4.  
  5. TRUE
  6.  
  7. FALSE
  8.  
  9. TRUE
"{ }" - curly braces match a preceding character for a specified number of repetitions:
"{m}" - the preceding element is matched m times
"{m,}" - the preceding element is matched m times or more
"{m,n}" - the preceding element is matched between m and n times
In [15]:
ex_str3 <- c("aabcbcb","abbb","abbaab","aabb")

grepl(pattern = "a{2}b{2,}",    # match 2 a's then 2 or more b's
      ex_str3)
Out[15]:
  1. FALSE
  2.  
  3. FALSE
  4.  
  5. FALSE
  6.  
  7. TRUE
"\\" - double backslashes let you "escape" metacharacters. You must escape metacharacters when you actually want to match the metacharacter symbol itself. For instance, if you want to match periods you can't use "." because it is a metacharacter that matches anything. Instead, you'd use "\\." to escape the period's metacharacter behavior and match the period itself.
In [16]:
ex_str4 <- c("Mr. Ed","Dr. Mario","Miss\\Mrs Granger.")

grepl(pattern = "\\. ",    # Match a single period and then a space
      ex_str4)
Out[16]:
  1. TRUE
  2.  
  3. TRUE
  4.  
  5. FALSE
*Note: if you want to match the escape character backslash itself, you have to use four backslashes "\\\\".
There are more regular expression intricacies we won't cover here, but combinations of the few symbols we've covered give you a great amount of expressive power. Regular expressions are commonly used to perform tasks like matching phone numbers, email addresses and web addresses in blocks of text.
It turns out that several of the R functions we've discussed accept regular expression patterns instead of the fixed substrings we've been using. These functions include: grep(), grepl(), sub(), gsub() and strsplit(). Remember, you can always get more details of a function by checking the docs with help().
Now it's time to use some of the new tools we have in our toolbox on the Reddit comment data. Let's say we are only interested in posts that contain web links. If we want to narrow down comments to only those with web links, we'll need to match comments that agree with some pattern that expresses the textual form of a web link. Let's try using a simple regular expression to find posts with web links.
Web links begin with "http:" or "https:" so let's make a regular expression that matches those substrings:
In [17]:
indexes_of_links <- grep("https?:",
                        comments$body)

posts_with_links <- comments$body[indexes_of_links]

print( length( posts_with_links ) )                # Check number of posts matched

print(head(posts_with_links, 5))                   # Check a few of the posts
[1] 216
[1] "[My reaction.](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPmtsN8NdRoU4whICZARz7JAD5lC33JqOhFZLiZbqHTrbau23VJG6E5lTdvKdnDigfomvb3zozn6U9x_e4rfx86Vb2KVNsskGA9s4DHCRIWhvi5qZZDXamkmHZpNgJZb2QEKbJqV-OXbLW/s1600/2.gif)"                              
[2] "http://imgur.com/gallery/Zch2AWw"                                                                                                   
[3] "[January 4th, 2005 - 47 Pts, 17 Rebs](https://www.youtube.com/watch?v=iLRsJ9gcW0Y)\n\n[bonus gif](http://i.imgur.com/8bKXHT9.gifv)" 
[4] "[You're right.](http://espn.go.com/nba/notebook/_/page/ROY1415/2014-15-rookie-year-predictions) This article was posted October 27."
[5] "https://www.youtube.com/watch?v=K1VtZht_8t4\n\nGame 7 of the Western Conference semi final vs. the Kings.  2004."                   
It appears the comments we've returned all contain web links. It is possible that a post could contain the string "http:" without actually having a web link. If we wanted to reduce this possibility, we'd have to be more specific with our regular expression pattern, but in the case of a basketball-themed forum, it is pretty unlikely.
Now that we've identified posts that contain web links, lets extract the links themselves. Many of the posts contain both web links and a bunch of text the user wrote. We want to get rid of the text keep the web links. It is possible to do this with R's base regular expression functions, but in this case its easier to use a package. Let's install and load the "stringr" package, which provides a variety of convenience functions for working with character data in R:
In [18]:
# Run install.packages to install stringr:
# install.packages("stringr")

library(stringr)             # Then load the package
Now we can use stringr function str_extract_all(). This function takes a regex pattern and a character vector and returns a list where each list item is a vector of all the matches found within each string. If some comments include multiple web links, some of the vectors in our list should have more than 1 element:
In [19]:
only_links <- str_extract_all(pattern = "https?:[^ \n\\)]+",  # A pattern to match
                              string = posts_with_links)

print( head(only_links,10))               # Check the head of the list returned
[[1]]
[1] "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPmtsN8NdRoU4whICZARz7JAD5lC33JqOhFZLiZbqHTrbau23VJG6E5lTdvKdnDigfomvb3zozn6U9x_e4rfx86Vb2KVNsskGA9s4DHCRIWhvi5qZZDXamkmHZpNgJZb2QEKbJqV-OXbLW/s1600/2.gif"

[[2]]
[1] "http://imgur.com/gallery/Zch2AWw"

[[3]]
[1] "https://www.youtube.com/watch?v=iLRsJ9gcW0Y"
[2] "http://i.imgur.com/8bKXHT9.gifv"            

[[4]]
[1] "http://espn.go.com/nba/notebook/_/page/ROY1415/2014-15-rookie-year-predictions"

[[5]]
[1] "https://www.youtube.com/watch?v=K1VtZht_8t4"

[[6]]
[1] "https://www.youtube.com/watch?v=mFEzW1Z6TRM"

[[7]]
[1] "https://instagram.com/p/2HWfB3o8rK/"

[[8]]
[1] "https://www.youtube.com/watch?v=524h48CWlMc&amp;feature=iv&amp;src_vid=vNqmpbMPX9k&amp;annotation_id=annotation_1856196647"

[[9]]
[1] "http://i.imgur.com/OrjShZv.jpg"

[[10]]
[1] "http://content.sportslogos.net/logos/6/232/full/923_minnesota-timberwolves-stadium-2012.png"

The pattern we used to match web links may look confusing, so let's go over it step by step.
First the pattern matches the exact characters "http", an optional "s" and then ":".
Next, with [^ \n\\)], we create a set of characters to match. Since our set starts with "^", we are actually matching the negation of the set. In this case, the set is the space character, the newline character "\n" and the closing parenthesis character ")". Notice we had to escape the closing parenthesis character by writing "\\)". Since we are matching the negation, this set matches any character that is NOT a space, newline or closing parenthesis. Finally, the "+" at the end matches this set 1 or more times.
To summarize, the regex matches http: or https: at the start and then any number of characters until it encounters a space, newline or closing parenthesis. This regex isn't perfect: a web address could contain parentheses and a space, newline or closing parenthesis might not be the only characters that mark the end of a web link in a comment. It is good enough for this small data set, but for a serious project we would probably want something a little more specific to handle such corner cases.
Complex regular expressions can be difficult to write and confusing to read. Sometimes it is easiest to simply search the web for a regular expression to perform a common task instead of writing one from scratch.

*Note: if you copy a regex written for another language it might not work in R without some modifications.

Wrap Up

In this lesson we learned several functions for dealing with text data in R and introduced regular expressions, a powerful tool for matching substrings in text. Regular expressions are used in many programming languages and although the syntax for regex varies a bit for one language to another, the basic constructs are similar across languages.
Next time we'll turn our attention to cleaning and preparing numeric data.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.