Comparing Transformation Styles: attach, transform, mutate and within
Posted on January 22, 2013 by Bob Muenchen
There are several ways
to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!
Here’s a data frame with one variable:
> mydata <- data.frame(x = 1:5)
> mydata
x
1 1
2 2
3 3
4 4
5 5
Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. Thesimplest way
to do that is using dollar
format: mydata$x
. I’ll make a copy of the data first so we can do the transformation several ways:
> mydata.new <- mydata
> mydata.new$x2 <- mydata.new$x ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100
> mydata.new
x x2 x3
1 1 1 101
2 2 4 104
3 3 9 109
4 4 16 116
5 5 25 125
That works, but I had to type more characters for the “mydata.new" part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use theattach
function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x" instead of “mydata$x". However, the attach
function is tricky to use. Here’s the most common mistake made by beginners:
> mydata.new <- mydata
> attach(mydata.new)
> x2 <- x ^ 2
> x3 <- x2 + 100
> mydata.new
x
1 1
2 2
3 3
4 4
5 5
There are no error messages, but the variables are not in the data frame! The attach
function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:
> ls()
[1] "mydata" "mydata.new" "x2" "x3"
> x2; x3
[1] 1 4 9 16 25
[1] 101 104 109 116 125
I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detach
mydata.new so we can start fresh.
> rm(x2, x3)
> detach(mydata.new)
We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:
> mydata.new <- mydata
> attach(mydata.new)
> mydata.new$x2 <- x ^ 2
> mydata.new$x3 <- x2 + 100
Error: object 'x2' not found
> detach(mydata.new)</pre>
The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2" you would have to attach
mydata.new again. You could also get around this problem by using dollar format in the second equation:
> attach(mydata.new)
> mydata.new$x2 <- x ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100
> mydata.new
x x2 x3
1 1 1 101
2 2 4 104
3 3 9 109
4 4 16 116
5 5 25 125
> detach(mydata.new)
That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.
The transform
function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.
> mydata.new <- transform(mydata, x2 = x ^ 2)
> mydata.new
x x2
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
Notice that when calling the transform
function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-"
.
Eliminating the tedious repetition of“mydata$…"
makes the formulas easier to enter, read and debug. However, thetransform
function has a problem: it is unable to use a variable that it just created. For example:
> mydata.new <- transform(mydata,
+ x2 = x ^ 2,
+ x3 = x2 + 100 )
Error in eval(expr, envir, enclos) : object 'x2' not found
We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.
In the above code, note the comma between the two equations. Since transform
uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.
Hadley Wickham’sdplyr package
has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:
> library("dplyr")
> mydata.new <- mutate(mydata,
+ x2 = x ^ 2,
+ x3 = x2 + 100
+ )
> mydata.new
x x2 x3
1 1 1 101
2 2 4 104
3 3 9 109
4 4 16 116
5 5 25 125
However, mutate
does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:
> mydata.new <- mutate(mydata,
+ x2 = x ^ 2,
+ x2 = x2 + 100)
> mydata.new
x x2
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using theifelse
function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)
Finally, we come to thewithin
function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:
> mydata.new <- within(mydata, {
+ x2 <- x ^ 2
+ x3 <- x2 + 100
+ } )
> mydata.new
x x3 x2
1 1 101 1
2 2 104 4
3 3 109 9
4 4 116 16
5 5 125 25
Notice that we’re back to using the assignment operator“<-"
and commas are not used between formulas. Multiple formulas must be enclosed in {braces}
. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.
When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:
> mydata.new <- within(mydata, {
+ x2 <- x ^ 2
+ x2 <- x2 + 100
+ } )
> mydata.new
x x2
1 1 101
2 2 104
3 3 109
4 4 116
5 5 125
Since the within
function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!
That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.
转载文献:http://r4stats.com/2013/01/22/comparing-tranformation-styles/