Wednesday, July 29, 2015

Introduction to R Part 10: Reading and Writing Data


Reading data into your R environment is the first step in conducting data analysis. Data comes in many different forms and although R is equipped to deal with most data formats, this lesson will focus on reading common data formats like comma separated values files (CSV) and Microsoft Excel files.

R Working Directory and File Paths

Before we can jump in and starting loading data, we need to learn a little bit about R's working directory and file paths. When you run R, it starts in a default location in your computer's file system called the working directory. You can check your working directory with the getwd() function:
In [1]:
getwd()                                 # Get the current working directory
Out[1]:
"C:/Users/Greg"
The working directory acts as your starting point for accessing other files on your computer. To load data into R from your hard disk, you either need to put the data file in your working directory, change your working directory to the folder containing the data or supply the data's file path to the data reading function.
You can change your working directory by supplying a new file path in quotes to the setwd() function:
In [2]:
setwd("C:/Users/Greg/Desktop")         # Set a new working directory

getwd()                                # Check the working directory again
Out[2]:
"C:/Users/Greg/Desktop"
*Note: you can use forward slashes for your file path even in Windows which normally uses backslashes. If you want to use backslashes for file paths in Windows you should use double backslashes (\\)
Instead of worrying about slashes in filepaths, you can have R construct file paths for you using the file.path() function. It takes a comma separated sequence of character strings and then uses them to construct a file path string for you:
In [3]:
my_path <- file.path("C:","Users","Greg","Desktop","Kaggle")  # Construct path

print (my_path )                # Check the path

setwd(my_path)                  # Set the working directory to the path

getwd()                         # Check the working directory again
[1] "C:/Users/Greg/Desktop/Kaggle"
Out[3]:
"C:/Users/Greg/Desktop/Kaggle"
In RStudio you can also change the working directory under the "Session" dropdown menu. Under session select "Set working directory", "Choose Directory", navigate to the folder you want to set as your working directory and click "Select folder."
You can list the files and folders in the current working directory using the list.files() function:
In [4]:
list.files()       # A list of files and folders in my Kaggle directory
Out[4]:
  1. "~$draft2015.xlsx"
  2.  
  3. "africa_soil_properties"
  4.  
  5. "analytics_edge"
  6.  
  7. "analytics_edge2"
  8. "analytics_edge3"
  9.  
  10. "Applied Predictive Modeling notes.R"
  11.  
  12. "applied_logreg1.R"
  13. "avito_ad_clicks"
  14.  
  15. "BB_data.csv"
  16.  
  17. "bike_sharing"
  18.  
  19. "caltech6.R"
  20.  
  21. "caltech7.R"
  22. "caltechfinal.R"
  23.  
  24. "caltechfinalwork.R"
  25.  
  26. "cat_tubing"
  27.  
  28. "crowd_flower"
  29. "denoising_dirty_docs"
  30.  
  31. "digit_recognizer"
  32.  
  33. "dplyr.R"
  34.  
  35. "draft2015.csv"
  36.  
  37. "draft2015.tsv"
  38. "draft2015.xlsx"
  39.  
  40. "facebook_bots"
  41.  
  42. "finding_elo"
  43.  
  44. "forest_ground_cover"
  45. "helping_santa"
  46.  
  47. "liberty_mutual"
  48.  
  49. "march_madness2015"
  50.  
  51. "nba_analysis"
  52. "nba_by_numbers.html"
  53.  
  54. "nba_by_numbers.md"
  55.  
  56. "nba_by_numbers.Rmd"
  57. "otto_group"
  58.  
  59. "poker_rule_induction"
  60.  
  61. "poker_writeup.html"
  62.  
  63. "poker_writeup.md"
  64. "R_ intro.R"
  65.  
  66. "rcaltech.R"
  67.  
  68. "sf_crime"
  69.  
  70. "stufftolearn.txt"
  71.  
  72. "tab_food"
  73.  
  74. "taxi_prediction"
  75. "testing_python_in_knitr.html"
  76.  
  77. "testing_python_in_knitr.md"
  78. "testing_python_in_knitr.Rmd"
  79.  
  80. "titanic"
  81.  
  82. "tradeshift_text_classification"
  83.  
  84. "UTWeek7.R"
  85. "walmart_sales"
  86.  
  87. "west_nile"
  88.  
  89. "word2vec"

Read CSV and TSV Files

Data is commonly stored in simple text files consisting of values delimited (separated) by a special character. For instance, CSV files use commas as the delimiter and tab separated value files (TSV) use tabs as the delimiter.
You can use the read.csv() function to read CSV files into R:
In [5]:
draft <- read.csv(file ="draft2015.csv",      # Path to the file
                  stringsAsFactors = FALSE)   # Encode characters as factors?

print(head(draft,15))
                Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1   Karl-Anthony Towns             1   1     1     1              1  1
2        Jahlil Okafor             2   2     2     2              2  2
3      Emmanuel Mudiay             7   6     6     6              7  6
4     D'Angelo Russell             3   3     4     4              3  3
5   Kristaps Porzingis             6   5     3     3              4  4
6        Mario Hezonja             4   7     8     7              6  7
7      Justise Winslow             5   4     5     5              5  5
8  Willie Cauley-Stein            13   9     7    11              9 11
9      Stanley Johnson             8   8    12     8              8 10
10        Myles Turner            12  10    13    12             11 12
11        Bobby Portis            17  15    17    20             17 15
12        Devin Booker            10  11     9    13             13  8
13      Frank Kaminsky             9  12    10     9             12  9
14     Kelly Oubre Jr.            15  13    14    10             21 16
15          Trey Lyles            11  17    20    15             15 17
Data loaded into R via read.csv() becomes data frame.
To load tab separated values, include the sep argument and set it to the tab character "\t":
In [6]:
draft2 <- read.csv(file="draft2015.tsv",   # Path to the TSV file
          sep = "\t",                      # Use tabs as the delimiting character
          stringsAsFactors = FALSE)

print(head(draft2,15))
                Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1   Karl-Anthony Towns             1   1     1     1              1  1
2        Jahlil Okafor             2   2     2     2              2  2
3      Emmanuel Mudiay             7   6     6     6              7  6
4     D'Angelo Russell             3   3     4     4              3  3
5   Kristaps Porzingis             6   5     3     3              4  4
6        Mario Hezonja             4   7     8     7              6  7
7      Justise Winslow             5   4     5     5              5  5
8  Willie Cauley-Stein            13   9     7    11              9 11
9      Stanley Johnson             8   8    12     8              8 10
10        Myles Turner            12  10    13    12             11 12
11        Bobby Portis            17  15    17    20             17 15
12        Devin Booker            10  11     9    13             13  8
13      Frank Kaminsky             9  12    10     9             12  9
14     Kelly Oubre Jr.            15  13    14    10             21 16
15          Trey Lyles            11  17    20    15             15 17
The read.csv() function is an extension of a more general data reading function called read.table(). read.csv() just sets a few arguments of read.table() to values suitable for reading CSV and TSV files. The read.table() function has numerous additional arguments that have various effects on reading data; there are too many arguments to cover them all in detail here but you can always get more information by checking the function documents with ?read.table or help(read.table).

Read Excel Files

Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the extension .xls or .xlsx.
One simple way to read Excel data into R is to open an Excel workbook using Excel, save the data in CSV format or as a tab-delimited text file and then use the read.csv() function to load the data into R.
If you want to read data from a .xls or .xlsx file directly into R, you'll need to download a package. Packages are extensions to the base R software library that give you access to additional functions. You can install packages from CRAN by supplying the name of the package to the install.packages() function. To read Excel Files, we need the "xlsx" package. When you attempt to install a package in RStudio you will be prompted to select a web mirror; choose one close you.
In [7]:
install.packages("xlsx",  repos='http://cran.us.r-project.org')
package 'xlsx' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
 C:\Users\Greg\AppData\Local\Temp\RtmpOwcCd4\downloaded_packages
*Note: I had to supply a CRAN mirror manually since I'm using a program that makes it easy to export text and code to a web friendly format instead of RStudio.
*Note: when you install a package, it may have dependencies that have to be installed first.
After installing a package, you can load it into your R environment with the library() function:
In [8]:
library(xlsx)       # library() loads in a package and its dependencies
Loading required package: rJava
Loading required package: xlsxjars
With our new package in hand, we can use its read.xlsx() function to read Excel files directly:
In [9]:
draft3 <- read.xlsx("draft2015.xlsx", 1)  # Reads the first worksheet in the file

print(head(draft3))
              Player Draft_Express CBS CBS_2 CBS_3 BleacherReport SI
1 Karl-Anthony Towns             1   1     1     1              1  1
2      Jahlil Okafor             2   2     2     2              2  2
3    Emmanuel Mudiay             7   6     6     6              7  6
4   D'Angelo Russell             3   3     4     4              3  3
5 Kristaps Porzingis             6   5     3     3              4  4
6      Mario Hezonja             4   7     8     7              6  7
If you want to read a specific worksheet in an excel workbook, supply the sheetName argument:
In [10]:
dummy_data <- read.xlsx("draft2015.xlsx", 
                        sheetName="dummy_data") # Loads in the specified worksheet

print(dummy_data)
        This Is Dummy Data
1  sometimes  2     0 fast
2    missing  4     0 fast
3       <NA>  7     1 slow
4       data  5     1 slow
5         is  3     0 fast
6       <NA>  4     0 slow
7  sometimes  6     0 fast
8  sometimes  5    NA slow
9       <NA>  5    NA fast
10   missing  4     0 fast

Reading Web Data

The Internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with getting data from the web and loading it into R. Websites that offer data for download usually let you download data as CSV, TSV or excel files.
The easiest way to use web data in R, is to simply download data to your hard drive in CSV, TSV or an excel file format and then use the functions we discussed earlier to load the data into R. You can supply a url to read.csv() or read.table() to read data directly from the web, but doing so can be problematic since web data isn't always formatted nicely. It can be helpful to do a little data cleaning, such as deleting unnecessary titles, images or other oddities in excel or a text editor to prepare data for use in R. In addition, large data sets often come in compressed formats like .zip and need to be decompressed before loading them into R so they aren't always easy loaded directly from the web.
Reading from the clipboard is another option for reading web data and other tabular data. To read in data from the clipboard, highlight the data you want to copy and use the appropriate copy function as if you were going to copy and paste the data. Next, use the read.csv() or read.table() function with the the first argument set to "clipboard":
In [11]:
# Go to http://www.basketball-reference.com/leagues/NBA_2015_totals.html
# click the CSV button to format data and then copy some data to the clipboard

BB_reference_data <- read.csv("clipboard")       # Read data from the clipboard

print ( head(BB_reference_data, 10) )            # Check the data
   Rk         Player Pos Age  Tm  G GS   MP  FG FGA   FG. X3P X3PA X3P.1 X2P
1   1     Quincy Acy  PF  24 NYK 68 22 1287 152 331 0.459  18   60 0.300 134
2   2   Jordan Adams  SG  20 MEM 30  0  248  35  86 0.407  10   25 0.400  25
3   3   Steven Adams   C  21 OKC 70 67 1771 217 399 0.544   0    2 0.000 217
4   4    Jeff Adrien  PF  28 MIN 17  0  215  19  44 0.432   0    0    NA  19
5   5  Arron Afflalo  SG  29 TOT 78 72 2502 375 884 0.424 118  333 0.354 257
6   5  Arron Afflalo  SG  29 DEN 53 53 1750 281 657 0.428  82  243 0.337 199
7   5  Arron Afflalo  SG  29 POR 25 19  752  94 227 0.414  36   90 0.400  58
8   6  Alexis Ajinca   C  26 NOP 68  8  957 181 329 0.550   0    0    NA 181
9   7 Furkan Aldemir  PF  23 PHI 41  9  540  40  78 0.513   0    5 0.000  40
10  8   Cole Aldrich   C  26 NYK 61 16  976 144 301 0.478   0    0    NA 144
   X2PA X2P.1  eFG.  FT FTA   FT. ORB DRB TRB AST STL BLK TOV  PF  PTS
1   271 0.494 0.486  76  97 0.784  79 222 301  68  27  22  60 147  398
2    61 0.410 0.465  14  23 0.609   9  19  28  16  16   7  14  24   94
3   397 0.547 0.544 103 205 0.502 199 324 523  66  38  86  99 222  537
4    44 0.432 0.432  22  38 0.579  23  54  77  15   4   9   9  30   60
5   551 0.466 0.491 167 198 0.843  27 220 247 129  41   7 116 167 1035
6   414 0.481 0.490 127 151 0.841  21 159 180 101  32   5  83 108  771
7   137 0.423 0.493  40  47 0.851   6  61  67  28   9   2  33  59  264
8   329 0.550 0.550  81  99 0.818 104 211 315  47  21  51  69 151  443
9    73 0.548 0.513  13  27 0.481  78  98 176  28  17  16  17  96   93
10  301 0.478 0.478  50  64 0.781 101 237 338  75  37  65  59 122  338
Data comes in all sorts of formats other than the friendly ones we've discussed thus far. R has functions and packages for working with data in other common data formats like SAS, SPSS and Stata files, json, xml, html and databases. We won't cover how to deal with all the different data sources you might encounter in this lesson, but rest assured that there is probably a way to work with your data in R if you do some digging. If you encounter a data source you don't know how to work with, a little bit of Googling will usually reveal how to convert it into a more familiar format or use an R package to deal with it directly.

Writing Data To CSV

In the course of cleaning data, data analysis and predictive modeling, you'll generate new data. You can write data in an R data frame to CSV using the write.csv() function:
In [12]:
write.csv(BB_reference_data,      # Name of variable assigned to the data       
         "BB_data.csv",           # Name of the file to create to store the data
          row.names = FALSE,)     # Whether to include row names in the file
Data is written to your current working directory. It's a good idea to save data after long, computationally expensive operations so that you don't lose progress or results.
Now that we know the basics of reading and writing data, we are almost ready to start exploring data, but before diving in we will spend a couple lessons learning basic R programming constructs. 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.