Tagged with setup

Truncate by Delimiter in R

Sometimes, you only need to analyze part of the data stored as a vector. In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent.

patents <- data.frame(
    patent = 1:30,
    class = c("405", "33/209", "549/514", "110", "540", "43", 
    "315/327", "540", "536/514", "523/522", "315", 
    "138/248/285", "24", "365", "73/116/137", "73/200", 
    "252/508", "96/261", "327/318", "426/424/512", 
    "75/423", "430", "416", "536/423/530", "381/181", "4", 
    "340/187", "423/75", "360/392/G9B", "524/106/423"))

We can use regular expressions to truncate each element of the vector just before the first "/".

grep,grepl,sub,gsub,regexpr,gregexpr, and regexec are all functions in the base package that allow you to use regular expressions within each element of a character vector. sub and gsub allow you to replace within each element of the vector. sub replaces the first match within each element, while gsub replaces all matches within each element. In this case, we want to remove everything from the first "/" on, and we want to replace it with nothing. Here's how we can use sub to do that:

patents$primaryClass <- sub("/.*", "", patents$class)

> table(patents$primaryClass)

110 138  24 252 315 327  33 340 360 365 381   4 405 416 423 426  43 430 523 524 
  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
536 540 549  73  75  96 
  2   2   1   2   1   1

This post is one part of my series on Text to Columns.

Citations and Further Reading

Tagged , , , , ,

Only Load Data If Not Already Open in R

I often find it beneficial to check to see whether or not a dataset is already loaded into R at the beginning of a file. This is particularly helpful when I'm dealing with a large file that I don't want to load repeatedly, and when I might be using the same dataset with multiple R scripts or re-running the same script while making changes to the code.

To check to see if an object with that name is already loaded, we can use the exists function from the base package. We can then wrap our read.csv command with an if statement to cause the file to only load if an object with that name is not already loaded.

if(!exists("largeData")) {
    largeData <- read.csv("huge-file.csv",
        header = TRUE)
}

You will probably also find it useful to use the "colClasses" option of read.csv or read.table to help the file load faster and make sure your data are in the right format. For example:

if(!exists("largeData")) {
    largeData <- read.csv("huge-file.csv",
        header = TRUE,
        colClasses = c("factor", "integer", "character", "integer", 
            "integer", "character"))
}

This post is one part of my series on dealing with large datasets.

Tagged , , , , , ,

Using colClasses to Load Data More Quickly in R

Specifying a colClasses argument to read.table or read.csv can save time on importing data, while also saving steps to specify classes for each variable later.

For example, loading a 893 MB took 441 seconds to load when not using colClasses, but only 268 seconds to load when using colClasses. The system.time function in base can help you check your own times.

Without specifying colClasses:

   user  system elapsed 
441.224   8.200 454.155

When specifying colClasses:

   user  system elapsed 
268.036   6.096 284.099

Dates that are in the form %Y-%m-%d or Y/%m/%d will import correctly. This tip allows you to import dates properly for dates in other formats.

system.time(largeData <- read.csv("huge-file.csv",
    header = TRUE,
    colClasses = c("character", "character", "complex", 
        "factor", "factor", "character", "integer", 
        "integer", "numeric", "character", "character",
        "Date", "integer", "logical")))

If there aren't any classes that you want to change from their defaults, you can read in the first few rows, determine the classes from that, and then import the rest of the file:

sampleData <- read.csv("huge-file.csv", header = TRUE, nrows = 5)
classes <- sapply(sampleData, class)
largeData <- read.csv("huge-file.csv", header = TRUE, colClasses = classes)
str(largeData)

If you aren't concerned about the time it takes to read the data file, but instead just want the classes to be correct on import, you have the option of only specifying certain classes:

smallData <- read.csv("small-file.csv", 
    header = TRUE,
    colClasses=c("variableName"="character"))

> class(smallData$variableName)
[1] "character"

This post is one part of my series on dealing with large datasets.

Citations and Further Reading

In a comment, Michael pointed out that if you don't need all the columns in your dataset, providing their colClass as NULL will exclude them from being loaded.

Tagged , , , , , ,

Date Formats in R

Importing Dates

Dates can be imported from character, numeric, POSIXlt, and POSIXct formats using the as.Date function from the base package.

If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format.

Importing Dates from Character Format

If your dates are stored as characters, you simply need to provide as.Date with your vector of dates and the format they are currently stored in. The possible date segment formats are listed in a table below.

For example,

"05/27/84" is in the format %m/%d/%y, while "May 27 1984" is in the format %B %d %Y.

To import those dates, you would simply provide your dates and their format (if not specified, it tries %Y-%m-%d and then %Y/%m/%d):

dates <- c("05/27/84", "07/07/05")
betterDates <- as.Date(dates,
    format = "%m/%d/%y")
> betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read "Changing Date Formats" below.

Importing Dates from Numeric Format

If you are importing data from Excel, you may have dates that are in a numeric format. We can still use as.Date to import these, we simply need to know the origin date that Excel starts counting from, and provide that to as.Date.

For Excel on Windows, the origin date is December 30, 1899 for dates after 1900. (Excel's designer thought 1900 was a leap year, but it was not.) For Excel on Mac, the origin date is January 1, 1904.

# from Windows Excel:
    dates <- c(30829, 38540)
    betterDates <- as.Date(dates,
        origin = "1899-12-30")

>   betterDates
[1] "1984-05-27" "2005-07-07"

# from Mac Excel:
    dates <- c(29367, 37078)
    betterDates <- as.Date(dates,
        origin = "1904-01-01")

>   betterDates
[1] "1984-05-27" "2005-07-07"

This outputs the dates in the ISO 8601 international standard format %Y-%m-%d. If you would like to use dates in a different format, read the next step:

Changing Date Formats

If you would like to use dates in a format other than the standard %Y-%m-%d, you can do that using the format function from the base package.

For example,

format(betterDates,
    "%a %b %d")

[1] "Sun May 27" "Thu Jul 07"

Correct Centuries

If you are importing data with only two digits for the years, you will find that it assumes that years 69 to 99 are 1969-1999, while years 00 to 68 are 2000--2068 (subject to change in future versions of R).

Often, this is not what you intend to have happen. This page gives a good explanation of several ways to fix this depending on your preference of centuries. One solution it provides is to assume all dates R is placing in the future are really from the previous century. That solution is as follows:

dates <- c("05/27/84", "07/07/05", "08/17/20")
betterDates <- as.Date(dates, "%m/%d/%y")

> betterDates
[1] "1984-05-27" "2005-07-07" "2020-08-17"

correctCentury <- as.Date(ifelse(betterDates > Sys.Date(), 
    format(betterDates, "19%y-%m-%d"), 
    format(betterDates)))

> correctCentury
[1] "1984-05-27" "2005-07-07" "1920-08-17"

Purpose of Proper Formatting

Having your dates in the proper format allows R to know that they are dates, and as such knows what calculations it should and should not perform on them. For one example, see my post on plotting weekly or monthly totals. Here are a few more examples:

>   mean(betterDates)
[1] "1994-12-16"

>   max(betterDates)
[1] "2005-07-07"

>   min(betterDates)
[1] "1984-05-27"

The code is available in a gist.

Date Formats

Conversion specification Description Example
%a Abbreviated weekday Sun, Thu
%A Full weekday Sunday, Thursday
%b or %h Abbreviated month May, Jul
%B Full month May, July
%d Day of the month 27, 07
01-31
%j Day of the year 148, 188
001-366
%m Month 05, 07
01-12
%U Week 22, 27
01-53
with Sunday as first day of the week
%w Weekday 0, 4
0-6
Sunday is 0
%W Week 21, 27
00-53
with Monday as first day of the week
%x Date, locale-specific
%y Year without century 84, 05
00-99
%Y Year with century 1984, 2005
on input:
00 to 68 prefixed by 20
69 to 99 prefixed by 19
%C Century 19, 20
%D Date formatted %m/%d/%y 05/27/84, 07/07/05
%u Weekday 7, 4
1-7
Monday is 1
%n Newline on output or
Arbitrary whitespace on input
%t Tab on output or
Arbitrary whitespace on input

References

Tagged , , , ,

Text to Columns in Stata

If you've ever used Excel's text to columns feature, you know how valuable it can be. If you haven't ever used text to columns, it allows you to take one column of data and separate it into multiple columns using delimiters that you provide. One time this is helpful is when converting data from other formats.

If you're learning Stata, you may wonder how you can gain this useful functionality. There are a few different ways, but for now we'll be discussing split.

For the following example, I have imported some patent data where the four most relevant primary patent classes for each observation are listed in a single column. These are delimited by a "/" as can be seen below.

Data before transformation

I would like each of these classes to be included in its own column. To do this, I give Stata the following command:

split class, parse(/) generate(class)

Stata command and feedback

In this command, the first class is the name of the variable I want to transform, / is the delimiter, and generate(class) lets Stata to know that I would like the names of the new variables to each be class followed by an integer. In the example, the most /'s there were in class was two, so three class[n] variables are created.

Data after transformation

I can then drop class if I want to remove the original class variable.

I could have also used the option destring if I wanted to treat the patent classes as numbers.

Stata documentation excerpt

More use cases are shown in the split documentation. One example they provide allows you to use multiple delimiters. In this instance showing how to separate the names of court cases even if some are delimited by "v." and some by "vs."

For more complex situations, you can also use regular expressions.

Tagged , , ,