Tagged with datasets

Using Line Segments to Compare Values in R

Sometimes you want to create a graph that will allow the viewer to see in one glance:

  • The original value of a variable
  • The new value of the variable
  • The change between old and new

One method I like to use to do this is using geom_segment and geom_point in the ggplot2 package.

First, let's load ggplot2 and our data:

library(ggplot2)

data <- as.data.frame(USPersonalExpenditure) # data from package datasets
data$Category <- as.character(rownames(USPersonalExpenditure)) # this makes things simpler later

Next, we'll set up our plot and axes:

ggplot(data,
    aes(y = Category)) +
labs(x = "Expenditure",
    y = "Category") +

For geom_segment, we need to provide four variables. (Sometimes two of the four will be the same, like in this case.) x and y provide the start points, and xend and yend provide the endpoints.

In this case, we want to show the change between 1940 and 1960 for each category. Therefore our variables are the following:

  • x: "1940"
  • y: Category
  • xend: "1960"
  • yend: Category
geom_segment(aes(x = data$"1940",
  y = Category,
  xend = data$"1960",
  yend = Category),
 size = 1) +

Next, we want to plot points for the 1940 and 1960 values. We could do the same for the 1945, 1950, and 1955 values, if we wanted to.

geom_point(aes(x = data$"1940",
    color = "1940"),
    size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
    color = "1960"),
    size = 4, shape = 15) +

Finally, we'll finish up by touching up the legend for the plot:

scale_color_discrete(name = "Year") +
theme(legend.position = "bottom")

geom_segment, then geom_point

The order of geom_segment and the geom_points matters. The first geom line in the code will get plotted first. Therefore, if you want the points displayed over the segments, put the segments first in the code. Likewise, if you want the segments displayed over the points, put the points first in the code.

For example, we could change the middle section of the code to:

geom_point(aes(x = data$"1940",
  color = "1940"),
  size = 4, shape = 15) +
geom_point(aes(x = data$"1960",
  color = "1960"),
  size = 4, shape = 15) +

geom_segment(aes(x = data$"1940",
    y = Category,
    xend = data$"1960",
    yend = Category),
  size = 1) +

And the output would look like:

geom_point, then geom_segment

Similarly, if you have points that will be overlapping, make sure you think about which of the point lines you want R to plot first.

The code is available in a gist.

Tagged , , , , , ,

Stacked Bar Charts in R

Reshape Wide to Long

Let's use the Loblolly dataset from the datasets package. These data track the growth of some loblolly pine trees.

> Loblolly[1:10,]
   height age Seed
1    4.51   3  301
15  10.89   5  301
29  28.72  10  301
43  41.74  15  301
57  52.70  20  301
71  60.92  25  301
2    4.55   3  303
16  10.92   5  303
30  29.07  10  303
44  42.83  15  303

First, we need to convert the data to wide form, so each age (i.e. 3, 5, 10, 15, 20, 25) will have its own variable.

wide <- reshape(Loblolly,
    v.names = "height",
    timevar = "age",
    idvar = "Seed",
    direction = "wide")

> wide[1:5,]
  Seed height.3 height.5 height.10 height.15 height.20 height.25
1  301     4.51    10.89     28.72     41.74     52.70     60.92
2  303     4.55    10.92     29.07     42.83     53.88     63.39
3  305     4.79    11.37     30.21     44.40     55.82     64.10
4  307     3.91     9.48     25.66     39.07     50.78     59.07
5  309     4.81    11.20     28.66     41.66     53.31     63.05

Create Variables

Then we want to create new columns showing how much each tree has grown between data points. For example, instead of knowing a tree's height at age 10, we want to know how much it's grown between age 5 and age 10, so that can be a bar in our graph.

wide$h0.3 <- wide$height.3
wide$h3.5 <- wide$height.5 - wide$height.3
wide$h5.10 <- wide$height.10 - wide$height.5
wide$h10.15 <- wide$height.15 - wide$height.10
wide$h15.20 <- wide$height.20 - wide$height.15
wide$h20.25 <- wide$height.25 - wide$height.20

Plot Stacked Bar Chart

Finally, we want to plot all the new data points:

library(RColorBrewer)
sequential <- brewer.pal(6, "BuGn")
barplot(t(wide[,8:13]),
    names.arg = wide$Seed, # x-axis labels
    cex.names = 0.7, # makes x-axis labels small enough to show all
    col = sequential, # colors
    xlab = "Seed Source",
    ylab = "Height, Feet",
    xlim = c(0,20), # these two lines allow space for the legend
    width = 1) # these two lines allow space for the legend
legend("bottomright", 
    legend = c("20-25", "15-20", "10-15", "5-10", "3-5", "0-3"), #in order from top to bottom
    fill = sequential[6:1], # 6:1 reorders so legend order matches graph
    title = "Years")

Stacked bar chart

If you decide you'd rather have clustered bars instead of stacked bars, you can just add the option beside = TRUE to the barplot.

The full code is available in a gist.

Citations and Further Reading

Tagged , , , , ,

Descriptive Statistics of Groups in R

The sleep data set—provided by the datasets package—shows the effects of two different drugs on ten patients. Extra is the increase in hours of sleep; group is the drug given, 1 or 2; and ID is the patient ID, 1 to 10.

I'll be using this data set to show how to perform descriptive statistics of groups within a data set, when the data set is long (as opposed to wide).

First, we'll need to load up the psych package. The datasets package containing our data is probably already loaded.

library(psych)

The describe.by function in the psych package is what does the magic for us here. It will group our data by a variable we give it, and output descriptive statistics for each of the groups.

> describe.by(sleep, sleep$group)
group: 1
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 0.75 1.79   0.35    0.68 1.56 -1.6  3.7   5.3 0.42    -1.30 0.57
group*   2 10 1.00 0.00   1.00    1.00 0.00  1.0  1.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96
------------------------------------------------------------ 
group: 2
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 2.33 2.00   1.75    2.24 2.45 -0.1  5.5   5.6 0.28    -1.66 0.63
group*   2 10 2.00 0.00   2.00    2.00 0.00  2.0  2.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96

Of course, there are other ways to find the descriptive statistics of groups, and since you'll probably be doing further analysis on the groups, and you may be splitting the whole data into subsets by groups, it may be easiest to just use describe on each subset. But that's a topic for another post. And this is an easy way to quickly look at many groups, and a quick look is particularly essential for descriptive statistics.

Tagged , , , ,