Last active
August 10, 2023 21:06
-
-
Save Myfanwy/1e2a4fe3a0015c3b14472fd7a32f3af7 to your computer and use it in GitHub Desktop.
Datetime gotchas in R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# About time in R | |
# M. Johnston | |
# Thu Aug 10 13:06:35 2023 America/Los_Angeles ------------------------------ | |
#---------------------------------------------------------# | |
# Takeaway: For date-times in R (POSIXt objects), there are differences between what | |
# your R session references, stores, and displays. It's very important to be aware | |
# of these differences to avoid date time gotchas. | |
#---------------------------------------------------------# | |
# R conforms to the Unix (really POSIX) standards with datetimes. | |
# R will reference your system's locale when displaying time zone information on POSIXct objects, and when doing conversions with integers to UTC under the hood. You can find out what your system's timezone is with: | |
Sys.timezone() | |
# When you do not specify a timezone, R will still *DISPLAY* your local timezone when you print your POSIX object, because it is referencing your locale for the display. For example: | |
x = as.POSIXct("2022-08-18 12:05:00") | |
x | |
attributes(x) | |
# Once a datetime is formatted as POSIXt in R, that datetime is stored in memory as an integer throughout your R session. | |
# The value of the integer is the number of seconds elapsed since 1970-01-01 00:00:00 UTC (the Unix origin) to our datetime. All calculations with datetimes use these underlying integers, and then R converts any resulting integer(s) to a POSIXt datetime for display in the console. | |
### Examples: | |
# we make a POSIXct object called x, specifying the timezone. In California we're in PDT, so let's use that as a timezone: | |
x = as.POSIXct("2022-08-18 12:05:00", tz = "America/Los_Angeles") | |
x # R displays the time with an abbreviation of the specified timezone attached, PDT. NOTE: these abbreviations are not the same as the accepted timezone character strings (see ?OlsonNames). | |
attributes(x) # NOTE: the argument in as.POSIXct() is called "tz", but the timezone *attribute* of the POSIXt-class object created is called "tzone". | |
as.integer(x) # Even though our time has been set to PDT, this is actually the number of seconds elapsed from the Unix origin to what x is in UTC. R does the conversion from UTC to PDT behind the scenes when it displays the equivalent datetime in your console. | |
# Let's convert x to UTC by changing its tzone attribute: | |
attr(x, "tzone") = "UTC" # we tell R that x should actually display in UTC - By doing this, we are telling R that the instant in time "x" is the same as what it was before, but we would like to see that instant in time displayed in UTC, not PDT. This is exactly what lubridate::with_tz() does under the hood. It changes how the datetime is displayed in your console, but it does not change the actual instant in time that x represents. | |
x # We have converted the time that DISPLAYS (remember, originally it was 12:05, now it's displaying as 19:05. This is the same instant in time, displayed in different time zones), and... | |
attributes(x) # ...we have re-assigned the tzone attribute, it's now UTC instead of America/Los_Angeles. | |
as.integer(x) # But even though we changed the timezone, the integer has not changed. This is expected. Converting times back and forth between timezones does not change the instant in time the datetime stamp corresponds to, so it should not change the underlying integer. | |
# These next two times look the same if you ignore the timezone, but are not the same instant in time, so they do not have the same underlying integer: | |
as.integer(as.POSIXct("2022-08-18 12:05:00", tz = "UTC")) | |
as.integer(as.POSIXct("2022-08-18 12:05:00", tz = "etc/GMT+8")) | |
# So: all time values in R are stored as integers that correspond to the difference in seconds from the Unix origin. As mentioned above, separate from this integer, R will *display* datetimes differently based on the timezone attribute. If no timezone is provided, R will use the local time zone for displaying, as defined by your R session's system time and ultimately, your operating system: | |
Sys.timezone() | |
## Gotcha #1 | |
#------------------ | |
# Let's say we convert a character datetime with as.POSIXct(), but we don't specify a timezone: | |
x = as.POSIXct("2022-08-18 12:05:00") | |
x # because we didn't specify a timezone, R assumes we want it displayed in our R session's timezone. Behind the scenes, R has already done the conversion to the same instant in time in UTC, and stored the resulting integer. THIS IS IMPORTANT, because if I'm collaborating with someone in a different time zone and neither of us set the timezone attribute, we are working with different underlying integers from the start, because our offsets from UTC are not the same. | |
attributes(x) # there is no tzone attribute set - it's blank. Under the hood, R has assumed x belongs in your local timezone, calculated the offset from UTC, and then stored the difference in seconds to UTC origin as the integer. It then happily *displays* the datetime in our Sys.timezone() when we print x in the console. If you're in a different timezone than us right now, or if your computer is set to a different timezone, it will display that timezone. | |
# Solution: always specify the timezone when you're converting from character with as.POSIXct(). | |
## Gotcha # 2 | |
#----------------- | |
# Because my local timezone has been displayed to me, I might think, oops, R thinks these times are in in PDT. I actually want R to recognize them in PST, because that's how they were recorded in the real world. This is dangerous, because I may then try to "get R to recognize them as PST" by changing the how the time zone displays rather than changing actual underlying integer. Let's say I use with_tz() to "assign the right timezone", doing the same thing we did up on line 37: | |
x2 = lubridate::with_tz(x, tzone = "etc/GMT+8") # this assigns the timezone attribute, making it convert the displayed timestamp from my local timezone to the new assigned one. | |
attributes(x2) | |
x2 # I may not even notice that it has shifted the time by one hour, if I'm just looking for the timezone. If I do see the shift, maybe now I'm even more confused - I wanted the timestamp to be the same (12:05), I just wanted the timezone to change. The function I actually wanted to use here was lubridate::force_tz(), not lubridate::with_tz(). | |
# because I used lubridate::with_tz(): | |
as.integer(x2) | |
as.integer(x) # the stored integers are the same. | |
# Solution: know the difference between with_tz() and force_tz(); the former does not change the underlying integer, the latter does. | |
## Gotcha #3: | |
#---------------- | |
# Let's say we read in a time that we know is in PST, but we read it in from an .xlsx using readxl::read_excel(). Our datetimes are now in POSIXct, but the wrong timezone (UTC) is assumed, because readxl::read_excel() assumes *all* datetimes are in UTC. This means that our underlying integer is wrong on read-in, because it is representing a different instant in time than the ones that occurred in reality (12:00:00 UTC != 12:00:00 PST). | |
# Solution(s): Here, the "UTC" timezone attribute of our datetimes need to be forcibly shifted to the correct timezone, overwriting the stored integer to the correct value so that our calculations are sound, while also not changing the displayed datetime. This can only be done in two ways: you can convert the datetimes to character (which strips the datetime attributes) and then convert back to a POSIXt with the correct timezone specified using the "tz" argument; or, you can use lubridate::force_tz(), which does essentially the exact same thing. | |
## HOW TO AVOID GOTCHAS IN GENERAL | |
# Specify the timezone from the getgo, either with as.POSIXct() or force_tz(). Best practices should also probably be to name our DateTime columns with the time zone they were recorded in, i.e. "DateTimePDT" or "DateTimeUTC" (but if we're collaborating internationally, know that some timezones have the same abbreviation but are very different places on Earth). | |
### Example: | |
y = as.POSIXct("2022-08-18 12:05:00", tz = "etc/GMT+8") # By providing a timezone attribute from the start, I decide the way my datetime is displayed, and R stores the correct underlying integer. | |
as.integer(x) # seconds since origin for 2022-08-18 12:05 PDT | |
as.integer(y) # seconds since origin for 2022-08-18 12:05 PST | |
as.integer(y) - as.integer(x) # 3600 seconds - this is one hour, the difference we'd expect. Yay | |
## Final NOTEs | |
# readxl::read_excel() assumes the read-in datetime values are in UTC. | |
# read.csv treats everything as characters (or factors if you specify). This is nice because you can then just convert the character with as.POSIXct() and specify the timezone. | |
#*CAUTION: R lets us set arbitrary attributes, i.e. if we wrote "tz" instead of "tzone" when we reset the timezone attribute of x, we wouldn't be reassigning the real timezone attribute, we'd just be creating a new attribute called "tz". Flexible but dangerous. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment