Tagged with ggplot2

ggplot Fit Line and Lattice Fit Line in R

Let's add a fit line to a scatterplot!

Fit Line in Base Graphics

Here's how to do it in base graphics:

ols <- lm(Temp ~ Solar.R,
    data = airquality)

summary(ols)

plot(Temp ~ Solar.R,
    data = airquality)
abline(ols

Fit line in base graphics in R

Fit Line in ggplot

And here's how to do it in ggplot:

library(ggplot2)
ggplot(data = airquality,
        aes(Solar.R, Temp)) + 
    geom_point(pch = 19) + 
    geom_abline(intercept = ols$coefficients[1],
        slope = ols$coefficients[2])

You can access the info from your regression results through ols$coefficients.

Edit: Thanks to an anonymous commenter, I have learned that you can simplify this by using geom_smooth. This way you don't have to specify the intercept and slope of the fit line.

ggplot(data = airquality,
        aes(Solar.R, Temp)) + 
    geom_point(pch = 19) + 
    geom_smooth(method = lm,
        se = FALSE)

Fit line in ggplot in R

Fit Line in Lattice

In lattice, it's even easier. You don't even need to run a regression; you can just add to the type option.

library(lattice)

xyplot(Temp ~ Solar.R,
    data = airquality,
    type = c("p", "r"))

Fit Line in Lattice in R

The code is available in a gist.

References

Tagged , , , , , ,

Table as an Image in R

Usually, it's best to keep tables as text, but if you're making a lot of graphics, it can be helpful to be able to create images of tables.

PNG table

Creating the Table

After loading the data, let's first use this trick to put line breaks between the levels of the effect variable. Depending on your data, you may or may not need or want to do this.

library(OIdata)
data(birds)
library(gridExtra)

# line breaks between words for levels of birds$effect:
levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))

Next let's make our table:

xyTable <- table(birds$sky, birds$effect)

Now we can create an empty plot, center our table in it, and use the grid.table function from the gridExtra package to display the table and choose a font size.

plot.new()
grid.table(xyTable,
    # change font sizes:
    gpar.coltext = gpar(cex = 1.2),
    gpar.rowtext = gpar(cex = 1.2))

Now you can view and save the image just like any other plot.

The code is available in a gist.

Citations and Further Reading

Tagged , , , , , ,

Line Breaks Between Words in Axis Labels in ggplot in R

Sometimes when plotting factor variables in R, the graphics can look pretty messy thanks to long factor levels. If the level attributes have multiple words, there is an easy fix to this that often makes the axis labels look much cleaner.

Without Line Breaks

Here's the messy looking example:

No line breaks in axis labels

And here's the code for the messy looking example:

library(OIdata)
data(birds)
library(ggplot2)

ggplot(birds,
    aes(x = effect,
        y = speed)) +
geom_boxplot()

With Line Breaks

We can use regular expressions to add line breaks to the factor levels by substituting any spaces with line breaks:

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
    aes(x = effect,
        y = speed)) +
geom_boxplot()

Line breaks in axis labels

Just one line made the plot look much better, and it will carry over to other plots you make as well. For example, you could create a table with the same variable.

Horizontal Boxes

Here we can see the difference in a box plot with horizontal boxes. It's up to you to decide which style looks better:

No line breaks in axis labels

Line breaks in axis labels

library(OIdata)
data(birds)
library(ggplot2)

levels(birds$effect) <- gsub(" ", "\n", levels(birds$effect))
ggplot(birds,
    aes(x = effect,
        y = speed)) +
geom_boxplot() + 
coord_flip()

Just a note: if you're not using ggplot, the multi-line axis labels might overflow into the graph.

The code is available in a gist.

Citations and Further Reading

In a comment, Jason Bryer mentioned that you can also break the lines by using a set character width instead of breaking at every space. Here's the code he suggested: :::r sapply(strwrap(as.character(value), width=25, simplify=FALSE), paste, collapse="\n")

Tagged , , , , , , ,

Plot Weekly or Monthly Totals in R

When plotting time series data, you might want to bin the values so that each data point corresponds to the sum for a given month or week. This post will show an easy way to use cut and ggplot2's stat_summary to plot month totals in R without needing to reorganize the data into a second data frame.

Let's start with a simple sample data set with a series of dates and quantities:

library(ggplot2)
library(scales)

# load data:
log <- data.frame(Date = c("2013/05/25","2013/05/28","2013/05/31","2013/06/01","2013/06/02","2013/06/05","2013/06/07"), 
  Quantity = c(9,1,15,4,5,17,18))
log
str(log)


:::r
> log
        Date Quantity
1 2013/05/25        9
2 2013/05/28        1
3 2013/05/31       15
4 2013/06/01        4
5 2013/06/02        5
6 2013/06/05       17
7 2013/06/07       18

> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Factor w/ 7 levels "2013/05/25","2013/05/28",..: 1 2 3 4 5 6 7
 $ Quantity: num  9 1 15 4 5 17 18

Next, if the date data is not already in a date format, we'll need to convert it to date format:

# convert date variable from factor to date format:
log$Date <- as.Date(log$Date,
    "%Y/%m/%d") # tabulate all the options here
str(log)


:::r
> str(log)
'data.frame': 7 obs. of  2 variables:
 $ Date    : Date, format: "2013-05-25" "2013-05-28" ...
 $ Quantity: num  9 1 15 4 5 17 18

Next we need to create variables stating the week and month of each observation. For week, cut has an option that allows you to break weeks as you'd like, beginning weeks on either Sunday or Monday.

# create variables of the week and month of each observation:
log$Month <- as.Date(cut(log$Date,
  breaks = "month"))
log$Week <- as.Date(cut(log$Date,
  breaks = "week",
  start.on.monday = FALSE)) # changes weekly break point to Sunday
log

> log
        Date Quantity      Month       Week
1 2013-05-25        9 2013-05-01 2013-05-19
2 2013-05-28        1 2013-05-01 2013-05-26
3 2013-05-31       15 2013-05-01 2013-05-26
4 2013-06-01        4 2013-06-01 2013-05-26
5 2013-06-02        5 2013-06-01 2013-06-02
6 2013-06-05       17 2013-06-01 2013-06-02
7 2013-06-07       18 2013-06-01 2013-06-02

Finally, we can create either a line or bar plot of the data by month and by week, using stat_summary to sum up the values associated with each week or month:

# graph by month:
ggplot(data = log,
    aes(Month, Quantity)) +
    stat_summary(fun.y = sum, # adds up all observations for the month
        geom = "bar") + # or "line"
    scale_x_date(
        labels = date_format("%Y-%m"),
        breaks = "1 month") # custom x-axis labels

Time series plot, binned by month

# graph by week:
ggplot(data = log,
    aes(Week, Quantity)) +
    stat_summary(fun.y = sum, # adds up all observations for the week
        geom = "bar") + # or "line"
    scale_x_date(
        labels = date_format("%Y-%m-%d"),
        breaks = "1 week") # custom x-axis labels

Time series plot, totaled by week

The full code is available in a gist.

In a comment, Achim Zeileis pointed out that the aggregation part can be more easily handled using time series packages like zoo or xts.

References

Tagged , , , , , , ,

geom_point Legend with Custom Colors in ggplot

Formerly, I showed how to make line segments using ggplot.

Working from that previous example, there are only a few things we need to change to add custom colors to our plot and legend in ggplot.

First, we'll add the colors of our choice. I'll do this using RColorBrewer, but you can choose whatever method you'd like.

library(RColorBrewer)
colors = brewer.pal(8, "Dark2")

The next section will be exactly the same as the previous example, except for removing the scale_color_discrete line to make way for the scale_color_manual we'll be adding later.

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

ggplot(data,
    aes(x = Expenditure,
        y = Category)) +
labs(x = "Expenditure",
    y = "Category") +
geom_segment(aes(x = data$"1940",
        y = Category,
        xend = data$"1960",
        yend = Category),
    size = 1) +
geom_point(aes(x = data$"1940",
        color = "1940"), # these can be any string, they just need to be unique identifiers
    size = 4,
    shape = 15) +
geom_point(aes(x = data$"1960",
        color = "1960"),
    size = 4,
    shape = 15) +
theme(legend.position = "bottom") +

And finally, we'll add a scale_color_manual line to our plot. We need to define the name, labels, and colors of the plot.

scale_color_manual(name = "Year", # or name = element_blank()
    labels = c(1940, 1960),
    values = colors)

And here's our final plot, complete with whatever custom colors we've chosen in both the plot and legend:

geom_point in ggplot with custom colors in the graph and legend

I've updated the gist from the previous post to also include a file that has custom colors.

Tagged , , , , , ,

Elevation Profiles in R

First, let's load up our data. The data are available in a gist. You can convert your own GPS data to .csv by following the instructions here, using gpsbabel.

gps <- read.csv("callan.csv",
    header = TRUE)

Next, we can use the function SMA from the package TTR to calculate a moving average of the altitude or elevation data, if we want to smooth out the curve. We can define a constant for the number of data points we want to average to create each moving average value.

If you don't want to convert meters to feet, a metric version of the code is available in the gist (callanMetric.R).

library(TTR)
movingN <- 5 # define the n for the moving average calculations
gps$Altitude <- gps$Altitude * 3.281 # convert m to ft
gps$SMA <- SMA(gps$Altitude,
    n = movingN)
gps <- gps[movingN:length(gps$SMA), ] # remove first n-1 points

Next, we want to calculate the distance of each point. You can skip this step if your dataset already includes distances.

library(sp)
Dist <- 0
for(i in 2:length(gps$Longitude)) {
    Dist[i] = spDistsN1(as.matrix(gps[i,c("Longitude", "Latitude")]),
    c(gps$Longitude[i-1], gps$Latitude[i-1]),
    longlat = TRUE) / 1.609 # longlat so distances will be in km, then divide to convert to miles
}
gps$Dist <- Dist

DistTotal <- 0
for(i in 2:length(gps$Longitude)) {
    DistTotal[i] = Dist[i] + DistTotal[i-1]
}
gps$DistTotal <- DistTotal

And finally, we can plot our elevation data using geom_ribbons and ggplot:

library(ggplot2)
ggplot(gps, aes(x = DistTotal)) +
geom_ribbon(aes(ymin = 600, # change this to match your min below
        ymax = SMA),
    fill = "#1B9E77") + # put your altitude variable here if not using moving averages
labs(x = "Miles",
    y = "Elevation") +
scale_y_continuous(limits = c(600,1200)) # change this to limits appropriate for your region

Elevation profile in ggplot2

Code and data available in a gist.

Tagged , , , , , , , ,

Using Line Segments to Compare Values in R

Sometimes you want to create a graph that will allow the viewer to see in one glance:

  • The original value of a variable
  • The new value of the variable
  • The change between old and new

One method I like to use to do this is using geom_segment and geom_point in the ggplot2 package.

First, let's load ggplot2 and our data:

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

Next, we'll set up our plot and axes:

ggplot(data,
    aes(y = Category)) +
labs(x = "Expenditure",
    y = "Category") +

For geom_segment, we need to provide four variables. (Sometimes two of the four will be the same, like in this case.) x and y provide the start points, and xend and yend provide the endpoints.

In this case, we want to show the change between 1940 and 1960 for each category. Therefore our variables are the following:

  • x: "1940"
  • y: Category
  • xend: "1960"
  • yend: Category
geom_segment(aes(x = data$"1940",
  y = Category,
  xend = data$"1960",
  yend = Category),
 size = 1) +

Next, we want to plot points for the 1940 and 1960 values. We could do the same for the 1945, 1950, and 1955 values, if we wanted to.

geom_point(aes(x = data$"1940",
    color = "1940"),
    size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
    color = "1960"),
    size = 4, shape = 15) +

Finally, we'll finish up by touching up the legend for the plot:

scale_color_discrete(name = "Year") +
theme(legend.position = "bottom")

geom_segment, then geom_point

The order of geom_segment and the geom_points matters. The first geom line in the code will get plotted first. Therefore, if you want the points displayed over the segments, put the segments first in the code. Likewise, if you want the segments displayed over the points, put the points first in the code.

For example, we could change the middle section of the code to:

geom_point(aes(x = data$"1940",
  color = "1940"),
  size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
  color = "1960"),
  size = 4, shape = 15) +

geom_segment(aes(x = data$"1940",
    y = Category,
    xend = data$"1960",
    yend = Category),
  size = 1) +

And the output would look like:

geom_point, then geom_segment

Similarly, if you have points that will be overlapping, make sure you think about which of the point lines you want R to plot first.

The code is available in a gist.

Tagged , , , , , ,

Calculating Gini Coefficients for a Number of Locales at Once in R

The Gini coefficient is a measure of the inequality of a distribution, most commonly used to compare inequality in income or wealth among countries.

Let's first generate some random data to analyze. You can download my random data or use the code below to generate your own. Of course, if you generate your own, your graphs and results will be different from those shown below.

city <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
income <- sample(1:100000,
    100,
    replace = TRUE)
cities <- data.frame(city, income)

Next, let's graph our data:

library(ggplot2)
ggplot(cities,
    aes(income)) +
    stat_density(geom = "path",
        position = "identity") +
    facet_wrap(~ city, ncol = 2)

Histogram of each city's incomes // Your results will differ if using random data

The Gini coefficient is easy enough to calculate in R for a single locale using the gini function from the reldist package.

library(reldist)
gini(cities[which(cities$city == "A"), ]$income)

But we don't want to replicate this code over and over to calculate the Gini coefficient for a large number of locales. We also want the coefficients to be in a data frame for easy use in R or for export for use in another program.

There are many ways to automate a function to run over many subsets of a data frame. The most straightforward in our particular case is aggregate:

ginicities <- aggregate(income ~ city,
    data = cities,
    FUN = "gini")
names(ginisec) <- c("city", "gini")

> ginisec
   city      gini
1     A 0.2856827
2     B 0.3639070
3     C 0.3288934
4     D 0.1863783
5     E 0.3565739
6     F 0.2587475
7     G 0.3022642
8     H 0.3795288
9     I 0.3311034
10    J 0.2496933

And finally, let's go ahead and export our data using write.csv:

write.csv(ginicities, "gini.csv",
    row.names = FALSE)

While you're at it, you might want to try using other functions on your dataset, such as mean, median, and length.

The full code is available in a gist.

Citations and Further Reading

Tagged , , , , , , , ,