Tagged with analysis

Perform a Function on Each File in R

Sometimes you might have several data files and want to use R to perform the same function across all of them. Or maybe you have multiple files and want to systematically combine them into one file without having to open each file and manually copy the data out.

Fortunately, it's not complicated to use R to systematically iterate across files.

Finding or Choosing the Names of Data Files

There are multiple ways to find or choose the names of the files you want to analyze.

You can explicitly state the file names or you can get R to find any files with a particular extension.

Explicitly Stating File Names

fileNames <- c("sample1.csv", "sample2.csv")

Finding Files with a Specific Extension

In this case, we use Sys.glob from the base package to find all files including the wildcard "*.csv".

fileNames <- Sys.glob("*.csv")

Iterating Across All Files

We'll start with a loop and then we can add whatever functions we want to the inside of the loop:

for (fileName in fileNames) {

    # read data:
    sample <- read.csv(fileName,
        header = TRUE,
        sep = ",")

    # add more stuff here

}

For example, we could add one to every "Widget" value in each file and overwrite the old data with the new data:

for (fileName in fileNames) {

    # read old data:
    sample <- read.csv(fileName,
        header = TRUE,
        sep = ",")

    # add one to every widget value in every file:
    sample$Widgets <- sample$Widgets + 1

    # overwrite old data with new data:
    write.table(sample, 
        fileName,
        append = FALSE,
        quote = FALSE,
        sep = ",",
        row.names = FALSE,
        col.names = TRUE)

}

Or we could do the same thing, but create a new copy of each file:

extension <- "csv"

fileNames <- Sys.glob(paste("*.", extension, sep = ""))

fileNumbers <- seq(fileNames)

for (fileNumber in fileNumbers) {

    newFileName <-  paste("new-", 
        sub(paste("\\.", extension, sep = ""), "", fileNames[fileNumber]), 
        ".", extension, sep = "")

    # read old data:
    sample <- read.csv(fileNames[fileNumber],
        header = TRUE,
        sep = ",")

    # add one to every widget value in every file:
    sample$Widgets <- sample$Widgets + 1

    # write old data to new files:
    write.table(sample, 
        newFileName,
        append = FALSE,
        quote = FALSE,
        sep = ",",
        row.names = FALSE,
        col.names = TRUE)

}

In the above example, we used the paste and sub functions from the base package to automatically create new file names based on the old file names.

Or we could instead use each dataset to create an entirely new dataset, where each row is based on data from one file:

fileNames <- Sys.glob("*.csv")

for (fileName in fileNames) {

    # read original data:
    sample <- read.csv(fileName,
        header = TRUE,
        sep = ",")

    # create new data based on contents of original file:
    allWidgets <- data.frame(
        File = fileName,
        Widgets = sum(sample$Widgets))

    # write new data to separate file:
    write.table(allWidgets, 
        "Output/sample-allSamples.csv",
        append = TRUE,
        sep = ",",
        row.names = FALSE,
        col.names = FALSE)

}

In the above example, data.frame is used to create a new data row based on each data file. Then the append option of write.table is set to TRUE so that row can be added to the other rows created from other data files.

Those are just a few examples of how you can use R to perform the same function(s) on a large number of files without having to manually run each one. I'm sure you can think of more uses.

All the files are available on GitHub. You can see how eachFile.R, eachfile-newNames.R, and eachFile-append.R each do something different to the sample datasets.

An anonymous commenter pointed out that for big data files, he/she often uses: :::r setwd("my_working_directory_path") dat <- lapply(dir(pattern=".csv"), function(file) { dat.i <- read.csv(file) # ... data formating, subsetting, ... return(dat.i) } dat <- do.call("rbind", dat) I haven't tested it, but it is likely this is more efficient than using for loops in R.

Tagged , , , , , , ,

Calculating Gini Coefficients for a Number of Locales at Once in R

The Gini coefficient is a measure of the inequality of a distribution, most commonly used to compare inequality in income or wealth among countries.

Let's first generate some random data to analyze. You can download my random data or use the code below to generate your own. Of course, if you generate your own, your graphs and results will be different from those shown below.

city <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
income <- sample(1:100000,
    100,
    replace = TRUE)
cities <- data.frame(city, income)

Next, let's graph our data:

library(ggplot2)
ggplot(cities,
    aes(income)) +
    stat_density(geom = "path",
        position = "identity") +
    facet_wrap(~ city, ncol = 2)

Histogram of each city's incomes // Your results will differ if using random data

The Gini coefficient is easy enough to calculate in R for a single locale using the gini function from the reldist package.

library(reldist)
gini(cities[which(cities$city == "A"), ]$income)

But we don't want to replicate this code over and over to calculate the Gini coefficient for a large number of locales. We also want the coefficients to be in a data frame for easy use in R or for export for use in another program.

There are many ways to automate a function to run over many subsets of a data frame. The most straightforward in our particular case is aggregate:

ginicities <- aggregate(income ~ city,
    data = cities,
    FUN = "gini")
names(ginisec) <- c("city", "gini")

> ginisec
   city      gini
1     A 0.2856827
2     B 0.3639070
3     C 0.3288934
4     D 0.1863783
5     E 0.3565739
6     F 0.2587475
7     G 0.3022642
8     H 0.3795288
9     I 0.3311034
10    J 0.2496933

And finally, let's go ahead and export our data using write.csv:

write.csv(ginicities, "gini.csv",
    row.names = FALSE)

While you're at it, you might want to try using other functions on your dataset, such as mean, median, and length.

The full code is available in a gist.

Citations and Further Reading

Tagged , , , , , , , ,

Descriptive Statistics of Groups in R

The sleep data set—provided by the datasets package—shows the effects of two different drugs on ten patients. Extra is the increase in hours of sleep; group is the drug given, 1 or 2; and ID is the patient ID, 1 to 10.

I'll be using this data set to show how to perform descriptive statistics of groups within a data set, when the data set is long (as opposed to wide).

First, we'll need to load up the psych package. The datasets package containing our data is probably already loaded.

library(psych)

The describe.by function in the psych package is what does the magic for us here. It will group our data by a variable we give it, and output descriptive statistics for each of the groups.

> describe.by(sleep, sleep$group)
group: 1
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 0.75 1.79   0.35    0.68 1.56 -1.6  3.7   5.3 0.42    -1.30 0.57
group*   2 10 1.00 0.00   1.00    1.00 0.00  1.0  1.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96
------------------------------------------------------------ 
group: 2
       var  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
extra    1 10 2.33 2.00   1.75    2.24 2.45 -0.1  5.5   5.6 0.28    -1.66 0.63
group*   2 10 2.00 0.00   2.00    2.00 0.00  2.0  2.0   0.0  NaN      NaN 0.00
ID*      3 10 5.50 3.03   5.50    5.50 3.71  1.0 10.0   9.0 0.00    -1.56 0.96

Of course, there are other ways to find the descriptive statistics of groups, and since you'll probably be doing further analysis on the groups, and you may be splitting the whole data into subsets by groups, it may be easiest to just use describe on each subset. But that's a topic for another post. And this is an easy way to quickly look at many groups, and a quick look is particularly essential for descriptive statistics.

Tagged , , , ,