In the last two lessons we learned a variety of methods to prepare character and numeric data, but many data sets also contain dates that don't fit nicely into either category. Common date formats contain numbers and sometimes characters to specify months and days. Getting dates into a friendly format and extracting features of dates like month and year into new variables can be useful preprocessing steps.
For this lesson I've created some dummy date data in a few different formats. To read the data, copy the table of dates below and then use read.csv("clipboard", sep="\t", stringsAsFactors=FALSE)
In [1]:
dates <- read.csv("clipboard", sep="\t", stringsAsFactors=FALSE) # Load dates
In [2]:
dates # Check dates
Out[2]:
*Note: Your date data will contain an extra variable called "X" for the copied row names. Remove it with dates$X <- NULL
When you load data with date columns into R, is typically stored as a character vector:
In [3]:
dates[1,1]
typeof(dates[1,1])
Out[3]:
Out[3]:
To work with dates in R, you need to convert them from character format to a date format. R contains a built in function as.Date() that converts strings to dates:
In [4]:
first_col <- as.Date(dates$month_day_year, # Character vector to convert
format= "%m/%d/%y") # Format of the dates to convert
first_col # Check the new dates
typeof(first_col) # Check their type
Out[4]:
Out[4]:
When you use as.Date() you have to provide the format of the dates in the character data you are trying to convert. In the example above, the dates were in the month, day, year format with each number separated by a slash, so we had to provide the format string "%m/%d/%y". The default format for as.Date() is year, month, day separated by slashes or hyphens. The final column in our data set is in the default format, so we could convert it without supplying a custom format:
In [5]:
forth_col <- as.Date(dates$year_month_day)
forth_col
typeof(first_col)
Out[5]:
Out[5]:
The following is a list of date formatting codes:
In [6]:
# %d -> Day
# %m -> Numeric Month
# %b -> Abbreviated Month
# %B -> Full Month
# %y -> 2-digit year
# %Y -> 4-digit year
The dates we've printed to the screen might still look like character strings, but internally they are stored as numbers. (Note that the type has changed to "double".). R stores dates internally as the number of days since the first day of 1970, with dates before 1970 being stored as negative numbers. You can check the underlying numeric representation of a date with as.numeric():
In [7]:
as.numeric(forth_col)
Out[7]:
Date objects let you perform subtraction to check how many days passed between two dates:
In [8]:
forth_col[2]-forth_col[1]
Out[8]:
You can also extract the day of the week and month with weekdays() and months() respectively:
In [9]:
weekdays(forth_col)
months(forth_col)
Out[9]:
Out[9]:
You can check the current date using Sys.Date():
In [10]:
Sys.Date()
Out[10]:
And the current date/time with date():
In [11]:
date()
Out[11]:
Date-Time Conversion
The as.Date() function is a basic tool for dealing with dates, but it does not handle data that includes both dates and times. Date/time data includes the date as well as finer-grained time information like hours, minutes and seconds. R contains a couple of data classes, 'POSIXct' and'POSIXlt' to deal with date/time data. POSIXct encodes a date/time as the number of seconds since the first day of 1970. POSIXlt stores date/time information as a list with items like year, month, day, hour, minute and second. You can convert dates in string format to POSIX date types using as.POSIXct() and as.POSIXlt():
In [12]:
third_col_ct <- as.POSIXct(dates$date_time, # Date/time to convert
format = "%a %b %d %H:%M:%S %Y") # Date/time format
third_col_ct # Check dates
typeof(third_col_ct) # Check type
Out[12]:
Out[12]:
In [13]:
third_col_lt <- as.POSIXlt(dates$date_time, # Date/time to convert
format = "%a %b %d %H:%M:%S %Y") # Date/time format*
third_col_lt # Check dates
typeof(third_col_lt) # Check type
Out[13]:
Out[13]:
*Note: check the documents for the strftime function with ?strftime for more information on date/time formatting codes.
Both POSIXct and POSIXlt support subtraction to get the amount of time between two dates in days:
In [14]:
third_col_ct[2]-third_col_ct[1]
third_col_lt[2]-third_col_lt[1]
Out[14]:
Out[14]:
You can extract various features of a date/time encoded as POSIXlt:
In [15]:
third_col_lt$year # years
third_col_lt$mon # numeric month
third_col_lt$wday # day of the week
third_col_lt$mday # day of the month
third_col_lt$yday # day of the year
third_col_lt$hour # hours
third_col_lt$min # minutes
third_col_lt$sec # seconds
Out[15]:
Out[15]:
Out[15]:
Out[15]:
Out[15]:
Out[15]:
Out[15]:
Out[15]:
Lubridate
Lubridate is an R package designed to make it easy to work with dates. Lubridate contains a variety of functions that operate on dates stored in the POSIXct and POSIXlt formats.
Let's install and load lubridate and then go through some if its functions:
In [16]:
# install.packages("lubridate") # Uncomment this line to install
library(lubridate)
Lubridate has some useful for functions for dealing with POSIX dates:
In [17]:
am(third_col_lt) # Check if date/time occurred in am(TRUE) or pm(FALSE)
Out[17]:
In [18]:
decimal_date(third_col_lt) # Get decimal version of date/time in years
Out[18]:
In [19]:
leap_year(third_col_lt) # Is it a leap year?
Out[19]:
In [20]:
round_date(third_col_lt,
unit = c("year")) # Round date/time based on specified unit
Out[20]:
In [21]:
ceiling_date(third_col_lt,
unit = c("year")) # Round date/time up based on specified unit
Out[21]:
In [22]:
floor_date(third_col_lt,
unit = c("year")) # Round date/time down based on specified unit
Out[22]:
In [23]:
hour(third_col_lt) # Get hours
Out[23]:
In [24]:
minute(third_col_lt) # Get minutes
Out[24]:
In [25]:
second(third_col_lt) # get seconds
Out[25]:
In [26]:
month(third_col_lt) # Get month
Out[26]:
In [27]:
year(third_col_lt) # get year
Out[27]:
In [28]:
mday(third_col_lt) # Get day of month
Out[28]:
In [29]:
wday(third_col_lt) # Get day of week
Out[29]:
In [30]:
yday(third_col_lt) # Get day of year
Out[30]:
In [31]:
now() # Get the current date/time
Out[31]:
Lubridate also contains some more advanced functions, such as functions for specifying time periods and checking whether dates lie within time periods. We won't get into all the advanced functionality Lubridate offers, but it may be worth your time to dig into the package further if you need to perform some fancy operations with dates.
Wrap Up
Date data often requires some preprocessing before you can use it effectively. Base R has most of the tools you need to deal with dates, but the Lubridate package adds some convenience functions and extra functionality that can make dates a little easier to use.
Cleaning and prepocessing numeric, character and date data is sometimes all you need to do before you start a project. In some cases, however, your data may be split across several tables such as different worksheets in an excel file or different tables in a database. In these cases, you might have combine two tables together before proceeding with your project.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.