Filed under R

Only Load Data If Not Already Open in R

I often find it beneficial to check to see whether or not a dataset is already loaded into R at the beginning of a file. This is particularly helpful when I'm dealing with a large file that I don't want to load repeatedly, and when I might be using the same dataset with multiple R scripts or re-running the same script while making changes to the code.

To check to see if an object with that name is already loaded, we can use the exists function from the base package. We can then wrap our read.csv command with an if statement to cause the file to only load if an object with that name is not already loaded.

if(!exists("largeData")) {
    largeData <- read.csv("huge-file.csv",
        header = TRUE)
}

You will probably also find it useful to use the "colClasses" option of read.csv or read.table to help the file load faster and make sure your data are in the right format. For example:

if(!exists("largeData")) {
    largeData <- read.csv("huge-file.csv",
        header = TRUE,
        colClasses = c("factor", "integer", "character", "integer", 
            "integer", "character"))
}

This post is one part of my series on dealing with large datasets.

Tagged , , , , , ,

Using colClasses to Load Data More Quickly in R

Specifying a colClasses argument to read.table or read.csv can save time on importing data, while also saving steps to specify classes for each variable later.

For example, loading a 893 MB took 441 seconds to load when not using colClasses, but only 268 seconds to load when using colClasses. The system.time function in base can help you check your own times.

Without specifying colClasses:

   user  system elapsed 
441.224   8.200 454.155

When specifying colClasses:

   user  system elapsed 
268.036   6.096 284.099

Dates that are in the form %Y-%m-%d or Y/%m/%d will import correctly. This tip allows you to import dates properly for dates in other formats.

system.time(largeData <- read.csv("huge-file.csv",
    header = TRUE,
    colClasses = c("character", "character", "complex", 
        "factor", "factor", "character", "integer", 
        "integer", "numeric", "character", "character",
        "Date", "integer", "logical")))

If there aren't any classes that you want to change from their defaults, you can read in the first few rows, determine the classes from that, and then import the rest of the file:

sampleData <- read.csv("huge-file.csv", header = TRUE, nrows = 5)
classes <- sapply(sampleData, class)
largeData <- read.csv("huge-file.csv", header = TRUE, colClasses = classes)
str(largeData)

If you aren't concerned about the time it takes to read the data file, but instead just want the classes to be correct on import, you have the option of only specifying certain classes:

smallData <- read.csv("small-file.csv", 
    header = TRUE,
    colClasses=c("variableName"="character"))

> class(smallData$variableName)
[1] "character"

This post is one part of my series on dealing with large datasets.

Citations and Further Reading

In a comment, Michael pointed out that if you don't need all the columns in your dataset, providing their colClass as NULL will exclude them from being loaded.

Tagged , , , , , ,

Plot Weekly or Monthly Totals in R

When plotting time series data, you might want to bin the values so that each data point corresponds to the sum for a given month or week. This post will show an easy way to use cut and ggplot2's stat_summary to plot month totals in R without needing to reorganize the data into a second data frame.

Let's start with a simple sample data set with a series of dates and quantities:

library(ggplot2)
library(scales)

# load data:
log <- data.frame(Date = c("2013/05/25","2013/05/28","2013/05/31","2013/06/01","2013/06/02","2013/06/05","2013/06/07"), 
  Quantity = c(9,1,15,4,5,17,18))
log
str(log)


:::r
> log
        Date Quantity
1 2013/05/25        9
2 2013/05/28        1
3 2013/05/31       15
4 2013/06/01        4
5 2013/06/02        5
6 2013/06/05       17
7 2013/06/07       18

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Factor w/ 7 levels "2013/05/25","2013/05/28",..: 1 2 3 4 5 6 7
 $ Quantity: num  9 1 15 4 5 17 18

Next, if the date data is not already in a date format, we'll need to convert it to date format:

# convert date variable from factor to date format:
log$Date <- as.Date(log$Date,
    "%Y/%m/%d") # tabulate all the options here
str(log)


:::r
> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Date, format: "2013-05-25" "2013-05-28" ...
 $ Quantity: num  9 1 15 4 5 17 18

Next we need to create variables stating the week and month of each observation. For week, cut has an option that allows you to break weeks as you'd like, beginning weeks on either Sunday or Monday.

# create variables of the week and month of each observation:
log$Month <- as.Date(cut(log$Date,
  breaks = "month"))
log$Week <- as.Date(cut(log$Date,
  breaks = "week",
  start.on.monday = FALSE)) # changes weekly break point to Sunday
log

> log
        Date Quantity      Month       Week
1 2013-05-25        9 2013-05-01 2013-05-19
2 2013-05-28        1 2013-05-01 2013-05-26
3 2013-05-31       15 2013-05-01 2013-05-26
4 2013-06-01        4 2013-06-01 2013-05-26
5 2013-06-02        5 2013-06-01 2013-06-02
6 2013-06-05       17 2013-06-01 2013-06-02
7 2013-06-07       18 2013-06-01 2013-06-02

Finally, we can create either a line or bar plot of the data by month and by week, using stat_summary to sum up the values associated with each week or month:

# graph by month:
ggplot(data = log,
    aes(Month, Quantity)) +
    stat_summary(fun.y = sum, # adds up all observations for the month
        geom = "bar") + # or "line"
    scale_x_date(
        labels = date_format("%Y-%m"),
        breaks = "1 month") # custom x-axis labels

Time series plot, binned by month

# graph by week:
ggplot(data = log,
    aes(Week, Quantity)) +
    stat_summary(fun.y = sum, # adds up all observations for the week
        geom = "bar") + # or "line"
    scale_x_date(
        labels = date_format("%Y-%m-%d"),
        breaks = "1 week") # custom x-axis labels

Time series plot, totaled by week

The full code is available in a gist.

In a comment, Achim Zeileis pointed out that the aggregation part can be more easily handled using time series packages like zoo or xts.

References

Tagged , , , , , , ,

Date Formats in R

Importing Dates

Dates can be imported from character, numeric, POSIXlt, and POSIXct formats using the as.Date function from the base package.

If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format.

Importing Dates from Character Format

If your dates are stored as characters, you simply need to provide as.Date with your vector of dates and the format they are currently stored in. The possible date segment formats are listed in a table below.

For example,

"05/27/84" is in the format %m/%d/%y, while "May 27 1984" is in the format %B %d %Y.

To import those dates, you would simply provide your dates and their format (if not specified, it tries %Y-%m-%d and then %Y/%m/%d):

dates <- c("05/27/84", "07/07/05")
betterDates <- as.Date(dates,
    format = "%m/%d/%y")
> betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read "Changing Date Formats" below.

Importing Dates from Numeric Format

If you are importing data from Excel, you may have dates that are in a numeric format. We can still use as.Date to import these, we simply need to know the origin date that Excel starts counting from, and provide that to as.Date.

For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. (Excel's designer thought 1900 was a leap year, but it was not.) For Excel on Mac, the origin date is January 1, 1904.

# from Windows Excel:
    dates <- c(30829, 38540)
    betterDates <- as.Date(dates,
        origin = "1899-12-30")

>   betterDates
[1] "1984-05-27" "2005-07-07"

# from Mac Excel:
    dates <- c(29367, 37078)
    betterDates <- as.Date(dates,
        origin = "1904-01-01")

>   betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read the next step:

Changing Date Formats

If you would like to use dates in a format other than the standard %Y-%m-%d, you can do that using the format function from the base package.

For example,

format(betterDates,
    "%a %b %d")

[1] "Sun May 27" "Thu Jul 07"

Correct Centuries

If you are importing data with only two digits for the years, you will find that it assumes that years 69 to 99 are 1969-1999, while years 00 to 68 are 2000--2068 (subject to change in future versions of R).

Often, this is not what you intend to have happen. This page gives a good explanation of several ways to fix this depending on your preference of centuries. One solution it provides is to assume all dates R is placing in the future are really from the previous century. That solution is as follows:

dates <- c("05/27/84", "07/07/05", "08/17/20")
betterDates <- as.Date(dates, "%m/%d/%y")

> betterDates
[1] "1984-05-27" "2005-07-07" "2020-08-17"

correctCentury <- as.Date(ifelse(betterDates > Sys.Date(), 
    format(betterDates, "19%y-%m-%d"), 
    format(betterDates)))

> correctCentury
[1] "1984-05-27" "2005-07-07" "1920-08-17"

Purpose of Proper Formatting

Having your dates in the proper format allows R to know that they are dates, and as such knows what calculations it should and should not perform on them. For one example, see my post on plotting weekly or monthly totals. Here are a few more examples:

>   mean(betterDates)
[1] "1994-12-16"

>   max(betterDates)
[1] "2005-07-07"

>   min(betterDates)
[1] "1984-05-27"

The code is available in a gist.

Date Formats

Conversion specification Description Example
%a Abbreviated weekday Sun, Thu
%A Full weekday Sunday, Thursday
%b or %h Abbreviated month May, Jul
%B Full month May, July
%d Day of the month 27, 07
01-31
%j Day of the year 148, 188
001-366
%m Month 05, 07
01-12
%U Week 22, 27
01-53
with Sunday as first day of the week
%w Weekday 0, 4
0-6
Sunday is 0
%W Week 21, 27
00-53
with Monday as first day of the week
%x Date, locale-specific
%y Year without century 84, 05
00-99
%Y Year with century 1984, 2005
on input:
00 to 68 prefixed by 20
69 to 99 prefixed by 19
%C Century 19, 20
%D Date formatted %m/%d/%y 05/27/84, 07/07/05
%u Weekday 7, 4
1-7
Monday is 1
%n Newline on output or
Arbitrary whitespace on input
%t Tab on output or
Arbitrary whitespace on input

References

Tagged , , , ,

geom_point Legend with Custom Colors in ggplot

Formerly, I showed how to make line segments using ggplot.

Working from that previous example, there are only a few things we need to change to add custom colors to our plot and legend in ggplot.

First, we'll add the colors of our choice. I'll do this using RColorBrewer, but you can choose whatever method you'd like.

library(RColorBrewer)
colors = brewer.pal(8, "Dark2")

The next section will be exactly the same as the previous example, except for removing the scale_color_discrete line to make way for the scale_color_manual we'll be adding later.

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

ggplot(data,
    aes(x = Expenditure,
        y = Category)) +
labs(x = "Expenditure",
    y = "Category") +
geom_segment(aes(x = data$"1940",
        y = Category,
        xend = data$"1960",
        yend = Category),
    size = 1) +
geom_point(aes(x = data$"1940",
        color = "1940"), # these can be any string, they just need to be unique identifiers
    size = 4,
    shape = 15) +
geom_point(aes(x = data$"1960",
        color = "1960"),
    size = 4,
    shape = 15) +
theme(legend.position = "bottom") +

And finally, we'll add a scale_color_manual line to our plot. We need to define the name, labels, and colors of the plot.

scale_color_manual(name = "Year", # or name = element_blank()
    labels = c(1940, 1960),
    values = colors)

And here's our final plot, complete with whatever custom colors we've chosen in both the plot and legend:

geom_point in ggplot with custom colors in the graph and legend

I've updated the gist from the previous post to also include a file that has custom colors.

Tagged , , , , , ,

Shapefiles in R

Let's learn how to use Shapefiles in R. This will allow us to map data for complicated areas or jurisdictions like zipcodes or school districts. For the United States, many shapefiles are available from the [Census Bureau](http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.html. Our example will map U.S. national parks.

First, download the U.S. Parks and Protected Lands shape files from Natural Earth. We'll be using the ne_10m_parks_and_protected_lands_area.shp file.

Next, start working in R. First, we'll load the shapefile and maptools:

# load up area shape file:
library(maptools)
area <- readShapePoly("ne_10m_parks_and_protected_lands_area.shp")

# # or file.choose:
# area <- readShapePoly(file.choose())

Next we can set the colors we want to use. And then we can set up our basemap.

library(RColorBrewer)
colors <- brewer.pal(9, "BuGn")

library(ggmap)
mapImage <- get_map(location = c(lon = -118, lat = 37.5),
    color = "color",
    source = "osm",
    # maptype = "terrain",
    zoom = 6)

Next, we can use the fortify function from the ggplot2 package. This converts the crazy shape file with all its nested attributes into a data frame that ggmap will know what to do with.

area.points <- fortify(area)

Finally, we can map our shape files!

ggmap(mapImage) +
    geom_polygon(aes(x = long,
            y = lat,
            group = group),
        data = area.points,
        color = colors[9],
        fill = colors[6],
        alpha = 0.5) +
labs(x = "Longitude",
    y = "Latitude")

National Parks and Protected Lands in California and Nevada

Same figure, with a Stamen terrain basemap with ColorBrewer palette "RdPu"

Citations and Further Reading

Tagged , , , , , , , , ,

Elevation Profiles in R

First, let's load up our data. The data are available in a gist. You can convert your own GPS data to .csv by following the instructions here, using gpsbabel.

gps <- read.csv("callan.csv",
    header = TRUE)

Next, we can use the function SMA from the package TTR to calculate a moving average of the altitude or elevation data, if we want to smooth out the curve. We can define a constant for the number of data points we want to average to create each moving average value.

If you don't want to convert meters to feet, a metric version of the code is available in the gist (callanMetric.R).

library(TTR)
movingN <- 5 # define the n for the moving average calculations
gps$Altitude <- gps$Altitude * 3.281 # convert m to ft
gps$SMA <- SMA(gps$Altitude,
    n = movingN)
gps <- gps[movingN:length(gps$SMA), ] # remove first n-1 points

Next, we want to calculate the distance of each point. You can skip this step if your dataset already includes distances.

library(sp)
Dist <- 0
for(i in 2:length(gps$Longitude)) {
    Dist[i] = spDistsN1(as.matrix(gps[i,c("Longitude", "Latitude")]),
    c(gps$Longitude[i-1], gps$Latitude[i-1]),
    longlat = TRUE) / 1.609 # longlat so distances will be in km, then divide to convert to miles
}
gps$Dist <- Dist

DistTotal <- 0
for(i in 2:length(gps$Longitude)) {
    DistTotal[i] = Dist[i] + DistTotal[i-1]
}
gps$DistTotal <- DistTotal

And finally, we can plot our elevation data using geom_ribbons and ggplot:

library(ggplot2)
ggplot(gps, aes(x = DistTotal)) +
geom_ribbon(aes(ymin = 600, # change this to match your min below
        ymax = SMA),
    fill = "#1B9E77") + # put your altitude variable here if not using moving averages
labs(x = "Miles",
    y = "Elevation") +
scale_y_continuous(limits = c(600,1200)) # change this to limits appropriate for your region

Elevation profile in ggplot2

Code and data available in a gist.

Tagged , , , , , , , ,

GPS Basemaps in R Using get_map

There are many different maps you can use for a background map for your gps or other latitude/longitude data (i.e. any time you're using geom_path, geom_segment, or geom_point.)

get_map

Helpfully, there's just one function that will allow you to query Google Maps, OpenStreetMap, Stamen maps, or CloudMade maps: get_map in the ggmap package. You could also use either get_googlemap, get_openstreetmap, get_stamenmap, or get_cloudmademap, but instead you can just use get_map for the same functionality as all of those combined. This makes it easy to try out different basemaps for your data.

You need to supply get_map with your location data and the color, source, maptype, and zoom of the base map.

Let's go ahead and map the trails in Elwyn John Wildlife Sanctuary here in Atlanta. The csv data and R file are available in a gist.

gps <- read.csv("elwyn.csv",
    header = TRUE)

library(ggmap)
mapImageData <- get_map(location = c(lon = mean(gps$Longitude),
    lat = 33.824),
    color = "color", # or bw
    source = "google",
    maptype = "satellite",
    # api_key = "your_api_key", # only needed for source = "cloudmade"
    zoom = 17)

pathcolor <- "#F8971F"

ggmap(mapImageData,
    extent = "device", # "panel" keeps in axes, etc.
    ylab = "Latitude",
    xlab = "Longitude",
    legend = "right") +
    geom_path(aes(x = Longitude, # path outline
    y = Latitude),
    data = gps,
    colour = "black",
    size = 2) +
    geom_path(aes(x = Longitude, # path
    y = Latitude),
    colour = pathcolor,
    data = gps,
    size = 1.4) # +
# labs(x = "Longitude",
#   y = "Latitude") # if you do extent = "panel"

We'll be changing the four lines marked above in orange to change what basemap is used.

source = "google"

get_map option source = "google" (or using get_googlemap) downloads a map from the Google Maps API. The basemaps are © Google. Google Maps have four different maptype options: terrain, satellite, roadmap, and hybrid.

source = "google", maptype = "terrain"

source = "google", maptype = "terrain", zoom = 14

Max zoom: 14

source = "google", maptype = "satellite"

source = "google", maptype = "satellite", zoom = 17

Max zoom: 20

source = "google", maptype = "roadmap"

source = "google", maptype = "roadmap", zoom = 17

source = "google", maptype = "hybrid"

Hybrid combines roadmap and satellite. source = "google", maptype = "hybrid", zoom = 17

Max zoom: 14

source = "osm"

get_map option source = "osm" (or using get_openstreetmap) downloads a map from OpenStreetMap. These maps are Creative Commons licensed, specifically Attribution-ShareAlike 2.0 (CC-BY-SA). This means you are free to use the maps for commercial purposes, as long as you release your final product under the same Creative Commons license. OpenStreetMap has no maptype options.

source = "osm" (no maptype needed)

source = "osm", zoom = 17

Max zoom: 20

source = "stamen"

get_map option source = "stamen" (or using get_stamenmap) downloads a map from Stamen Maps. The map tiles are by Stamen Design, licensed under CC BY 3.0. The data for Stamen Maps is by OpenStreetMap, licensed under CC BY SA. Stamen has three different maptype options: terrain, watercolor, and toner.

source = "stamen", maptype = "terrain"

source = "stamen", maptype = "terrain", zoom = 17

Max zoom: 18

source = "stamen", maptype = "watercolor"

source = "stamen", maptype = "watercolor", zoom = 17

Max zoom: 18

source = "stamen", maptype = "toner"

source = "stamen", maptype = "toner", zoom = 17

Max zoom: 18

source = "cloudmade"

N.B. As of March 2014, CloudMade no longer provides this API service.

CloudMade styles build on top of OpenStreetMap data. Thousands of CloudMade styles are available. You can browse them on the CloudMade site. You can also make your own styles.

To use CloudMade map styles in R, you will first need to get an API key to insert into your R code so it can access the maps. You can get an API key from the CloudMade site.

Here are just a couple examples of CloudMade basemaps:

source = "cloudmade", maptype = "1", api_key="your_api_key_here, zoom = 17

source = "cloudmade", maptype = "67367", api_key="your_api_key_here, zoom = 17

Max zoom: 18

The code and data are available in a gist.

Tagged , , , , , , , , , , , , ,

Using Line Segments to Compare Values in R

Sometimes you want to create a graph that will allow the viewer to see in one glance:

  • The original value of a variable
  • The new value of the variable
  • The change between old and new

One method I like to use to do this is using geom_segment and geom_point in the ggplot2 package.

First, let's load ggplot2 and our data:

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

Next, we'll set up our plot and axes:

ggplot(data,
    aes(y = Category)) +
labs(x = "Expenditure",
    y = "Category") +

For geom_segment, we need to provide four variables. (Sometimes two of the four will be the same, like in this case.) x and y provide the start points, and xend and yend provide the endpoints.

In this case, we want to show the change between 1940 and 1960 for each category. Therefore our variables are the following:

  • x: "1940"
  • y: Category
  • xend: "1960"
  • yend: Category
geom_segment(aes(x = data$"1940",
  y = Category,
  xend = data$"1960",
  yend = Category),
 size = 1) +

Next, we want to plot points for the 1940 and 1960 values. We could do the same for the 1945, 1950, and 1955 values, if we wanted to.

geom_point(aes(x = data$"1940",
    color = "1940"),
    size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
    color = "1960"),
    size = 4, shape = 15) +

Finally, we'll finish up by touching up the legend for the plot:

scale_color_discrete(name = "Year") +
theme(legend.position = "bottom")

geom_segment, then geom_point

The order of geom_segment and the geom_points matters. The first geom line in the code will get plotted first. Therefore, if you want the points displayed over the segments, put the segments first in the code. Likewise, if you want the segments displayed over the points, put the points first in the code.

For example, we could change the middle section of the code to:

geom_point(aes(x = data$"1940",
  color = "1940"),
  size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
  color = "1960"),
  size = 4, shape = 15) +

geom_segment(aes(x = data$"1940",
    y = Category,
    xend = data$"1960",
    yend = Category),
  size = 1) +

And the output would look like:

geom_point, then geom_segment

Similarly, if you have points that will be overlapping, make sure you think about which of the point lines you want R to plot first.

The code is available in a gist.

Tagged , , , , , ,

Storing a Function in a Separate File in R

If you're going to be using a function across several different R files, you might want to store the function in its own file.

If you want to name the function in its own file

This is probably the best option in general, if only because you may want to put more than one function in a single file.

Next, let's make our function in the file fun.R: :::r mult <- function(x, y) { x*y }

If you get the warning message "In readLines(file) : incomplete final line found on 'fun.R'", just insert a line break at the end of the fun.R file.

If you want to name the function in the file running it

First let's make the same function (but this time unnamed) in the file times.R:

function(x, y) {
    x*y
}

Calling the functions

And finally we'll make a file file.R to call our functions:

times <- dget("times.R")
times(-4:4, 2)

source("fun.R")
mult(-4:4, 2)

Note: if you are used to using source to run your R code, note that in this case we are using the source command within a file.

All files are available as a gist.

Tagged , , ,