Tagged with cleaning

Truncate by Delimiter in R

Sometimes, you only need to analyze part of the data stored as a vector. In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent.

patents <- data.frame(
    patent = 1:30,
    class = c("405", "33/209", "549/514", "110", "540", "43", 
    "315/327", "540", "536/514", "523/522", "315", 
    "138/248/285", "24", "365", "73/116/137", "73/200", 
    "252/508", "96/261", "327/318", "426/424/512", 
    "75/423", "430", "416", "536/423/530", "381/181", "4", 
    "340/187", "423/75", "360/392/G9B", "524/106/423"))

We can use regular expressions to truncate each element of the vector just before the first "/".

grep,grepl,sub,gsub,regexpr,gregexpr, and regexec are all functions in the base package that allow you to use regular expressions within each element of a character vector. sub and gsub allow you to replace within each element of the vector. sub replaces the first match within each element, while gsub replaces all matches within each element. In this case, we want to remove everything from the first "/" on, and we want to replace it with nothing. Here's how we can use sub to do that:

patents$primaryClass <- sub("/.*", "", patents$class)

> table(patents$primaryClass)

110 138  24 252 315 327  33 340 360 365 381   4 405 416 423 426  43 430 523 524 
  1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
536 540 549  73  75  96 
  2   2   1   2   1   1

This post is one part of my series on Text to Columns.

Citations and Further Reading

Tagged , , , , ,

Text to Columns in Stata

If you've ever used Excel's text to columns feature, you know how valuable it can be. If you haven't ever used text to columns, it allows you to take one column of data and separate it into multiple columns using delimiters that you provide. One time this is helpful is when converting data from other formats.

If you're learning Stata, you may wonder how you can gain this useful functionality. There are a few different ways, but for now we'll be discussing split.

For the following example, I have imported some patent data where the four most relevant primary patent classes for each observation are listed in a single column. These are delimited by a "/" as can be seen below.

Data before transformation

I would like each of these classes to be included in its own column. To do this, I give Stata the following command:

split class, parse(/) generate(class)

Stata command and feedback

In this command, the first class is the name of the variable I want to transform, / is the delimiter, and generate(class) lets Stata to know that I would like the names of the new variables to each be class followed by an integer. In the example, the most /'s there were in class was two, so three class[n] variables are created.

Data after transformation

I can then drop class if I want to remove the original class variable.

I could have also used the option destring if I wanted to treat the patent classes as numbers.

Stata documentation excerpt

More use cases are shown in the split documentation. One example they provide allows you to use multiple delimiters. In this instance showing how to separate the names of court cases even if some are delimited by "v." and some by "vs."

For more complex situations, you can also use regular expressions.

Tagged , , ,