Tuesday, August 11, 2015

Introduction to R Part 16: Dealing With Dates


In the last two lessons we learned a variety of methods to prepare character and numeric data, but many data sets also contain dates that don't fit nicely into either category. Common date formats contain numbers and sometimes characters to specify months and days. Getting dates into a friendly format and extracting features of dates like month and year into new variables can be useful preprocessing steps.
For this lesson I've created some dummy date data in a few different formats. To read the data, copy the table of dates below and then use read.csv("clipboard", sep="\t", stringsAsFactors=FALSE)
In [1]:
dates <- read.csv("clipboard", sep="\t", stringsAsFactors=FALSE)   # Load dates
In [2]:
dates         # Check dates
Out[2]:
month_day_yearday_month_yeardate_timeyear_month_day
14/22/199622-Apr-96Tue Aug 11 09:50:35 19962007-06-22
24/23/199623-Apr-96Tue May 12 19:50:35 20162017-01-09
35/14/199614-May-96Mon Oct 14 09:50:35 20171998-04-12
45/15/199615-May-96Tue Jan 11 09:50:35 20182027-07-22
55/16/200116-May-01Fri Mar 11 07:30:36 20191945-11-15
65/17/200217-May-02Tue Aug 11 09:50:35 20201942-06-22
75/18/200318-May-03Wed Dec 21 09:50:35 20211887-06-13
85/19/200419-May-04Tue Jan 11 09:50:35 20221912-01-25
95/20/200520-May-05Sun Jul 10 19:40:25 20232007-06-22
*Note: Your date data will contain an extra variable called "X" for the copied row names. Remove it with dates$X <- NULL
When you load data with date columns into R, is typically stored as a character vector:
In [3]:
dates[1,1]

typeof(dates[1,1])
Out[3]:
"4/22/1996"
Out[3]:
"character"
To work with dates in R, you need to convert them from character format to a date format. R contains a built in function as.Date() that converts strings to dates:
In [4]:
first_col <- as.Date(dates$month_day_year,  # Character vector to convert
             format= "%m/%d/%y")            # Format of the dates to convert

first_col                                   # Check the new dates

typeof(first_col)                           # Check their type
Out[4]:
[1] "2019-04-22" "2019-04-23" "2019-05-14" "2019-05-15" "2020-05-16"
[6] "2020-05-17" "2020-05-18" "2020-05-19" "2020-05-20"
Out[4]:
"double"
When you use as.Date() you have to provide the format of the dates in the character data you are trying to convert. In the example above, the dates were in the month, day, year format with each number separated by a slash, so we had to provide the format string "%m/%d/%y". The default format for as.Date() is year, month, day separated by slashes or hyphens. The final column in our data set is in the default format, so we could convert it without supplying a custom format:
In [5]:
forth_col <- as.Date(dates$year_month_day)

forth_col

typeof(first_col)
Out[5]:
[1] "2007-06-22" "2017-01-09" "1998-04-12" "2027-07-22" "1945-11-15"
[6] "1942-06-22" "1887-06-13" "1912-01-25" "2007-06-22"
Out[5]:
"double"
The following is a list of date formatting codes:
In [6]:
# %d -> Day
# %m -> Numeric Month 
# %b -> Abbreviated Month
# %B -> Full Month
# %y -> 2-digit year
# %Y -> 4-digit year
The dates we've printed to the screen might still look like character strings, but internally they are stored as numbers. (Note that the type has changed to "double".). R stores dates internally as the number of days since the first day of 1970, with dates before 1970 being stored as negative numbers. You can check the underlying numeric representation of a date with as.numeric():
In [7]:
as.numeric(forth_col)
Out[7]:
  1. 13686
  2.  
  3. 17175
  4.  
  5. 10328
  6.  
  7. 21021
  8.  
  9. -8813
  10.  
  11. -10055
  12.  
  13. -30152
  14.  
  15. -21161
  16.  
  17. 13686
Date objects let you perform subtraction to check how many days passed between two dates:
In [8]:
forth_col[2]-forth_col[1]
Out[8]:
Time difference of 3489 days
You can also extract the day of the week and month with weekdays() and months() respectively:
In [9]:
weekdays(forth_col)

months(forth_col)
Out[9]:
  1. "Friday"
  2.  
  3. "Monday"
  4.  
  5. "Sunday"
  6.  
  7. "Thursday"
  8.  
  9. "Thursday"
  10.  
  11. "Monday"
  12.  
  13. "Monday"
  14.  
  15. "Thursday"
  16. "Friday"
Out[9]:
  1. "June"
  2.  
  3. "January"
  4.  
  5. "April"
  6.  
  7. "July"
  8.  
  9. "November"
  10.  
  11. "June"
  12.  
  13. "June"
  14.  
  15. "January"
  16.  
  17. "June"
You can check the current date using Sys.Date():
In [10]:
Sys.Date()
Out[10]:
[1] "2015-08-11"
And the current date/time with date():
In [11]:
date()
Out[11]:
"Tue Aug 11 14:56:36 2015"

Date-Time Conversion

The as.Date() function is a basic tool for dealing with dates, but it does not handle data that includes both dates and times. Date/time data includes the date as well as finer-grained time information like hours, minutes and seconds. R contains a couple of data classes, 'POSIXct' and'POSIXlt' to deal with date/time data. POSIXct encodes a date/time as the number of seconds since the first day of 1970. POSIXlt stores date/time information as a list with items like year, month, day, hour, minute and second. You can convert dates in string format to POSIX date types using as.POSIXct() and as.POSIXlt():
In [12]:
third_col_ct <- as.POSIXct(dates$date_time,                 # Date/time to convert
                          format = "%a %b %d %H:%M:%S %Y")  # Date/time format

third_col_ct                                                # Check dates

typeof(third_col_ct)                                        # Check type
Out[12]:
[1] "1996-08-11 09:50:35 CDT" "2016-05-12 19:50:35 CDT"
[3] "2017-10-14 09:50:35 CDT" "2018-01-11 09:50:35 CST"
[5] "2019-03-11 07:30:36 CDT" "2020-08-11 09:50:35 CDT"
[7] "2021-12-21 09:50:35 CST" "2022-01-11 09:50:35 CST"
[9] "2023-07-10 19:40:25 CDT"
Out[12]:
"double"
In [13]:
third_col_lt <- as.POSIXlt(dates$date_time,                 # Date/time to convert
                          format = "%a %b %d %H:%M:%S %Y")  # Date/time format*

third_col_lt                                                # Check dates

typeof(third_col_lt)                                        # Check type
Out[13]:
[1] "1996-08-11 09:50:35 CDT" "2016-05-12 19:50:35 CDT"
[3] "2017-10-14 09:50:35 CDT" "2018-01-11 09:50:35 CST"
[5] "2019-03-11 07:30:36 CDT" "2020-08-11 09:50:35 CDT"
[7] "2021-12-21 09:50:35 CST" "2022-01-11 09:50:35 CST"
[9] "2023-07-10 19:40:25 CDT"
Out[13]:
"list"
*Note: check the documents for the strftime function with ?strftime for more information on date/time formatting codes.
Both POSIXct and POSIXlt support subtraction to get the amount of time between two dates in days:
In [14]:
third_col_ct[2]-third_col_ct[1]

third_col_lt[2]-third_col_lt[1]
Out[14]:
Time difference of 7214.417 days
Out[14]:
Time difference of 7214.417 days
You can extract various features of a date/time encoded as POSIXlt:
In [15]:
third_col_lt$year   # years

third_col_lt$mon    # numeric month

third_col_lt$wday   # day of the week

third_col_lt$mday   # day of the month

third_col_lt$yday   # day of the year

third_col_lt$hour   # hours

third_col_lt$min    # minutes

third_col_lt$sec    # seconds
Out[15]:
  1. 96
  2.  
  3. 116
  4.  
  5. 117
  6.  
  7. 118
  8.  
  9. 119
  10.  
  11. 120
  12.  
  13. 121
  14.  
  15. 122
  16.  
  17. 123
Out[15]:
  1. 7
  2.  
  3. 4
  4.  
  5. 9
  6.  
  7. 0
  8.  
  9. 2
  10.  
  11. 7
  12.  
  13. 11
  14.  
  15. 0
  16.  
  17. 6
Out[15]:
  1. 0
  2.  
  3. 4
  4.  
  5. 6
  6.  
  7. 4
  8.  
  9. 1
  10.  
  11. 2
  12.  
  13. 2
  14.  
  15. 2
  16.  
  17. 1
Out[15]:
  1. 11
  2.  
  3. 12
  4.  
  5. 14
  6.  
  7. 11
  8.  
  9. 11
  10.  
  11. 11
  12.  
  13. 21
  14.  
  15. 11
  16.  
  17. 10
Out[15]:
  1. 223
  2.  
  3. 132
  4.  
  5. 286
  6.  
  7. 10
  8.  
  9. 69
  10.  
  11. 223
  12.  
  13. 354
  14.  
  15. 10
  16.  
  17. 190
Out[15]:
  1. 9
  2.  
  3. 19
  4.  
  5. 9
  6.  
  7. 9
  8.  
  9. 7
  10.  
  11. 9
  12.  
  13. 9
  14.  
  15. 9
  16.  
  17. 19
Out[15]:
  1. 50
  2.  
  3. 50
  4.  
  5. 50
  6.  
  7. 50
  8.  
  9. 30
  10.  
  11. 50
  12.  
  13. 50
  14.  
  15. 50
  16.  
  17. 40
Out[15]:
  1. 35
  2.  
  3. 35
  4.  
  5. 35
  6.  
  7. 35
  8.  
  9. 36
  10.  
  11. 35
  12.  
  13. 35
  14.  
  15. 35
  16.  
  17. 25

Lubridate

Lubridate is an R package designed to make it easy to work with dates. Lubridate contains a variety of functions that operate on dates stored in the POSIXct and POSIXlt formats.
Let's install and load lubridate and then go through some if its functions:
In [16]:
# install.packages("lubridate")     # Uncomment this line to install

library(lubridate)
Lubridate has some useful for functions for dealing with POSIX dates:
In [17]:
am(third_col_lt)        # Check if date/time occurred in am(TRUE) or pm(FALSE)
Out[17]:
  1. TRUE
  2.  
  3. FALSE
  4.  
  5. TRUE
  6.  
  7. TRUE
  8.  
  9. TRUE
  10.  
  11. TRUE
  12.  
  13. TRUE
  14.  
  15. TRUE
  16.  
  17. FALSE
In [18]:
decimal_date(third_col_lt)   # Get decimal version of date/time in years
Out[18]:
  1. 1996.61029634057
  2.  
  3. 2016.3628008943
  4.  
  5. 2017.78457112506
  6.  
  7. 2018.02852089675
  8. 2019.18978424658
  9.  
  10. 2020.61029634057
  11.  
  12. 2021.97098665018
  13.  
  14. 2022.02852089675
  15. 2023.52267963597
In [19]:
leap_year(third_col_lt)   # Is it a leap year?
Out[19]:
  1. TRUE
  2.  
  3. TRUE
  4.  
  5. FALSE
  6.  
  7. FALSE
  8.  
  9. FALSE
  10.  
  11. TRUE
  12.  
  13. FALSE
  14.  
  15. FALSE
  16.  
  17. FALSE
In [20]:
round_date(third_col_lt, 
          unit = c("year"))  # Round date/time based on specified unit
Out[20]:
[1] "1997-01-01 CST" "2016-01-01 CST" "2018-01-01 CST" "2018-01-01 CST"
[5] "2019-01-01 CST" "2021-01-01 CST" "2022-01-01 CST" "2022-01-01 CST"
[9] "2024-01-01 CST"
In [21]:
ceiling_date(third_col_lt,
          unit = c("year"))  # Round date/time up based on specified unit
Out[21]:
[1] "1997-01-01 CST" "2017-01-01 CST" "2018-01-01 CST" "2019-01-01 CST"
[5] "2020-01-01 CST" "2021-01-01 CST" "2022-01-01 CST" "2023-01-01 CST"
[9] "2024-01-01 CST"
In [22]:
floor_date(third_col_lt,
          unit = c("year"))  # Round date/time down based on specified unit
Out[22]:
[1] "1996-01-01 CST" "2016-01-01 CST" "2017-01-01 CST" "2018-01-01 CST"
[5] "2019-01-01 CST" "2020-01-01 CST" "2021-01-01 CST" "2022-01-01 CST"
[9] "2023-01-01 CST"
In [23]:
hour(third_col_lt)        # Get hours
Out[23]:
  1. 9
  2.  
  3. 19
  4.  
  5. 9
  6.  
  7. 9
  8.  
  9. 7
  10.  
  11. 9
  12.  
  13. 9
  14.  
  15. 9
  16.  
  17. 19
In [24]:
minute(third_col_lt)     # Get minutes
Out[24]:
  1. 50
  2.  
  3. 50
  4.  
  5. 50
  6.  
  7. 50
  8.  
  9. 30
  10.  
  11. 50
  12.  
  13. 50
  14.  
  15. 50
  16.  
  17. 40
In [25]:
second(third_col_lt)     # get seconds
Out[25]:
  1. 35
  2.  
  3. 35
  4.  
  5. 35
  6.  
  7. 35
  8.  
  9. 36
  10.  
  11. 35
  12.  
  13. 35
  14.  
  15. 35
  16.  
  17. 25
In [26]:
month(third_col_lt)      # Get month
Out[26]:
  1. 8
  2.  
  3. 5
  4.  
  5. 10
  6.  
  7. 1
  8.  
  9. 3
  10.  
  11. 8
  12.  
  13. 12
  14.  
  15. 1
  16.  
  17. 7
In [27]:
year(third_col_lt)       # get year
Out[27]:
  1. 1996
  2.  
  3. 2016
  4.  
  5. 2017
  6.  
  7. 2018
  8.  
  9. 2019
  10.  
  11. 2020
  12.  
  13. 2021
  14.  
  15. 2022
  16.  
  17. 2023
In [28]:
mday(third_col_lt)      # Get day of month
Out[28]:
  1. 11
  2.  
  3. 12
  4.  
  5. 14
  6.  
  7. 11
  8.  
  9. 11
  10.  
  11. 11
  12.  
  13. 21
  14.  
  15. 11
  16.  
  17. 10
In [29]:
wday(third_col_lt)      # Get day of week
Out[29]:
  1. 1
  2.  
  3. 5
  4.  
  5. 7
  6.  
  7. 5
  8.  
  9. 2
  10.  
  11. 3
  12.  
  13. 3
  14.  
  15. 3
  16.  
  17. 2
In [30]:
yday(third_col_lt)      # Get day of year
Out[30]:
  1. 224
  2.  
  3. 133
  4.  
  5. 287
  6.  
  7. 11
  8.  
  9. 70
  10.  
  11. 224
  12.  
  13. 355
  14.  
  15. 11
  16.  
  17. 191
In [31]:
now()                   # Get the current date/time
Out[31]:
[1] "2015-08-11 14:56:38 CDT"
Lubridate also contains some more advanced functions, such as functions for specifying time periods and checking whether dates lie within time periods. We won't get into all the advanced functionality Lubridate offers, but it may be worth your time to dig into the package further if you need to perform some fancy operations with dates.

Wrap Up

Date data often requires some preprocessing before you can use it effectively. Base R has most of the tools you need to deal with dates, but the Lubridate package adds some convenience functions and extra functionality that can make dates a little easier to use.
Cleaning and prepocessing numeric, character and date data is sometimes all you need to do before you start a project. In some cases, however, your data may be split across several tables such as different worksheets in an excel file or different tables in a database. In these cases, you might have combine two tables together before proceeding with your project.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.