Reading data into your R environment is the first step in conducting data analysis. Data comes in many different forms and although R is equipped to deal with most data formats, this lesson will focus on reading common data formats like comma separated values files (CSV) and Microsoft Excel files.
R Working Directory and File Paths
Before we can jump in and starting loading data, we need to learn a little bit about R's working directory and file paths. When you run R, it starts in a default location in your computer's file system called the working directory. You can check your working directory with the getwd() function:
In [1]:
getwd() # Get the current working directory
Out[1]:
The working directory acts as your starting point for accessing other files on your computer. To load data into R from your hard disk, you either need to put the data file in your working directory, change your working directory to the folder containing the data or supply the data's file path to the data reading function.
You can change your working directory by supplying a new file path in quotes to the setwd() function:
In [2]:
setwd("C:/Users/Greg/Desktop") # Set a new working directory
getwd() # Check the working directory again
Out[2]:
*Note: you can use forward slashes for your file path even in Windows which normally uses backslashes. If you want to use backslashes for file paths in Windows you should use double backslashes (\\)
Instead of worrying about slashes in filepaths, you can have R construct file paths for you using the file.path() function. It takes a comma separated sequence of character strings and then uses them to construct a file path string for you:
In [3]:
my_path <- file.path("C:","Users","Greg","Desktop","Kaggle") # Construct path
print (my_path ) # Check the path
setwd(my_path) # Set the working directory to the path
getwd() # Check the working directory again
Out[3]:
In RStudio you can also change the working directory under the "Session" dropdown menu. Under session select "Set working directory", "Choose Directory", navigate to the folder you want to set as your working directory and click "Select folder."
You can list the files and folders in the current working directory using the list.files() function:
In [4]:
list.files() # A list of files and folders in my Kaggle directory
Out[4]:
Read CSV and TSV Files
Data is commonly stored in simple text files consisting of values delimited (separated) by a special character. For instance, CSV files use commas as the delimiter and tab separated value files (TSV) use tabs as the delimiter.
You can use the read.csv() function to read CSV files into R:
In [5]:
draft <- read.csv(file ="draft2015.csv", # Path to the file
stringsAsFactors = FALSE) # Encode characters as factors?
print(head(draft,15))
Data loaded into R via read.csv() becomes data frame.
To load tab separated values, include the sep argument and set it to the tab character "\t":
In [6]:
draft2 <- read.csv(file="draft2015.tsv", # Path to the TSV file
sep = "\t", # Use tabs as the delimiting character
stringsAsFactors = FALSE)
print(head(draft2,15))
The read.csv() function is an extension of a more general data reading function called read.table(). read.csv() just sets a few arguments of read.table() to values suitable for reading CSV and TSV files. The read.table() function has numerous additional arguments that have various effects on reading data; there are too many arguments to cover them all in detail here but you can always get more information by checking the function documents with ?read.table or help(read.table).
Read Excel Files
Microsoft Excel is a ubiquitous enterprise spreadsheet program that stores data in its own format with the extension .xls or .xlsx.
One simple way to read Excel data into R is to open an Excel workbook using Excel, save the data in CSV format or as a tab-delimited text file and then use the read.csv() function to load the data into R.
If you want to read data from a .xls or .xlsx file directly into R, you'll need to download a package. Packages are extensions to the base R software library that give you access to additional functions. You can install packages from CRAN by supplying the name of the package to the install.packages() function. To read Excel Files, we need the "xlsx" package. When you attempt to install a package in RStudio you will be prompted to select a web mirror; choose one close you.
In [7]:
install.packages("xlsx", repos='http://cran.us.r-project.org')
*Note: I had to supply a CRAN mirror manually since I'm using a program that makes it easy to export text and code to a web friendly format instead of RStudio.
*Note: when you install a package, it may have dependencies that have to be installed first.
After installing a package, you can load it into your R environment with the library() function:
In [8]:
library(xlsx) # library() loads in a package and its dependencies
With our new package in hand, we can use its read.xlsx() function to read Excel files directly:
In [9]:
draft3 <- read.xlsx("draft2015.xlsx", 1) # Reads the first worksheet in the file
print(head(draft3))
If you want to read a specific worksheet in an excel workbook, supply the sheetName argument:
In [10]:
dummy_data <- read.xlsx("draft2015.xlsx",
sheetName="dummy_data") # Loads in the specified worksheet
print(dummy_data)
Reading Web Data
The Internet gives you access to more data than you could ever hope to analyze. Data analysis often begins with getting data from the web and loading it into R. Websites that offer data for download usually let you download data as CSV, TSV or excel files.
The easiest way to use web data in R, is to simply download data to your hard drive in CSV, TSV or an excel file format and then use the functions we discussed earlier to load the data into R. You can supply a url to read.csv() or read.table() to read data directly from the web, but doing so can be problematic since web data isn't always formatted nicely. It can be helpful to do a little data cleaning, such as deleting unnecessary titles, images or other oddities in excel or a text editor to prepare data for use in R. In addition, large data sets often come in compressed formats like .zip and need to be decompressed before loading them into R so they aren't always easy loaded directly from the web.
Reading from the clipboard is another option for reading web data and other tabular data. To read in data from the clipboard, highlight the data you want to copy and use the appropriate copy function as if you were going to copy and paste the data. Next, use the read.csv() or read.table() function with the the first argument set to "clipboard":
In [11]:
# Go to http://www.basketball-reference.com/leagues/NBA_2015_totals.html
# click the CSV button to format data and then copy some data to the clipboard
BB_reference_data <- read.csv("clipboard") # Read data from the clipboard
print ( head(BB_reference_data, 10) ) # Check the data
Data comes in all sorts of formats other than the friendly ones we've discussed thus far. R has functions and packages for working with data in other common data formats like SAS, SPSS and Stata files, json, xml, html and databases. We won't cover how to deal with all the different data sources you might encounter in this lesson, but rest assured that there is probably a way to work with your data in R if you do some digging. If you encounter a data source you don't know how to work with, a little bit of Googling will usually reveal how to convert it into a more familiar format or use an R package to deal with it directly.
Writing Data To CSV
In the course of cleaning data, data analysis and predictive modeling, you'll generate new data. You can write data in an R data frame to CSV using the write.csv() function:
In [12]:
write.csv(BB_reference_data, # Name of variable assigned to the data
"BB_data.csv", # Name of the file to create to store the data
row.names = FALSE,) # Whether to include row names in the file
Data is written to your current working directory. It's a good idea to save data after long, computationally expensive operations so that you don't lose progress or results.
Now that we know the basics of reading and writing data, we are almost ready to start exploring data, but before diving in we will spend a couple lessons learning basic R programming constructs.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.