最近有个培训班的小伙伴遇到了这样的问题,他想对字符串变量进行分组求和与分组累加,但是他不知道该如何实现,今天我们就来一起看一下。
Stata 的实现
首先我们先生成一份示例数据:
* 生成示例数据
clear
input int group int x str1 y
1 12 "A"
1 2 "B"
1 35 "C"
1 56 "D"
1 34 "E"
1 24 "F"
1 87 "G"
2 1 "a"
2 36 "b"
2 78 "c"
2 34 "d"
2 12 "e"
2 3 "f"
2 0 "g"
end
list, sep(7)
*> +----------------+
*> | group x y |
*> |----------------|
*> 1. | 1 12 A |
*> 2. | 1 2 B |
*> 3. | 1 35 C |
*> 4. | 1 56 D |
*> 5. | 1 34 E |
*> 6. | 1 24 F |
*> 7. | 1 87 G |
*> |----------------|
*> 8. | 2 1 a |
*> 9. | 2 36 b |
*> 10. | 2 78 c |
*> 11. | 2 34 d |
*> 12. | 2 12 e |
*> 13. | 2 3 f |
*> 14. | 2 0 g |
*> +----------------+
其中 group 变量是分组变量,x 是数值型变量,y 是字符型变量。
对于数值型变量,egen 的 sum() 函数可以实现分组求和的功能:
bysort group: egen xsum = sum(x)
list, sep(7)
*> +-----------------------+
*> | group x y xsum |
*> |-----------------------|
*> 1. | 1 12 A 250 |
*> 2. | 1 2 B 250 |
*> 3. | 1 35 C 250 |
*> 4. | 1 56 D 250 |
*> 5. | 1 34 E 250 |
*> 6. | 1 24 F 250 |
*> 7. | 1 87 G 250 |
*> |-----------------------|
*> 8. | 2 1 a 164 |
*> 9. | 2 36 b 164 |
*> 10. | 2 78 c 164 |
*> 11. | 2 34 d 164 |
*> 12. | 2 12 e 164 |
*> 13. | 2 3 f 164 |
*> 14. | 2 0 g 164 |
*> +-----------------------+
gen 的 sum() 函数可以实现分组累加的功能:
bysort group: gen xcumsum = sum(x)
list, sep(7)
*> +---------------------------------+
*> | group x y xsum xcumsum |
*> |---------------------------------|
*> 1. | 1 12 A 250 12 |
*> 2. | 1 2 B 250 14 |
*> 3. | 1 35 C 250 49 |
*> 4. | 1 56 D 250 105 |
*> 5. | 1 34 E 250 139 |
*> 6. | 1 24 F 250 163 |
*> 7. | 1 87 G 250 250 |
*> |---------------------------------|
*> 8. | 2 1 a 164 1 |
*> 9. | 2 36 b 164 37 |
*> 10. | 2 78 c 164 115 |
*> 11. | 2 34 d 164 149 |
*> 12. | 2 12 e 164 161 |
*> 13. | 2 3 f 164 164 |
*> 14. | 2 0 g 164 164 |
*> +---------------------------------+
但是上面两个都不能应用与字符串变量,但是字符串变量的加法在 Stata 中是可行的,例如:
gen yy = y + y
list, sep(7)
*> +--------------------------------------+
*> | group x y xsum xcumsum yy |
*> |--------------------------------------|
*> 1. | 1 12 A 250 12 AA |
*> 2. | 1 2 B 250 14 BB |
*> 3. | 1 35 C 250 49 CC |
*> 4. | 1 56 D 250 105 DD |
*> 5. | 1 34 E 250 139 EE |
*> 6. | 1 24 F 250 163 FF |
*> 7. | 1 87 G 250 250 GG |
*> |--------------------------------------|
*> 8. | 2 1 a 164 1 aa |
*> 9. | 2 36 b 164 37 bb |
*> 10. | 2 78 c 164 115 cc |
*> 11. | 2 34 d 164 149 dd |
*> 12. | 2 12 e 164 161 ee |
*> 13. | 2 3 f 164 164 ff |
*> 14. | 2 0 g 164 164 gg |
*> +--------------------------------------+
那我们如何在字符串变量上实现分组求和以及分组累加的功能呢?
可以通过设计下面的命令实现:
- 实现 egen 的 sum() 功能,也就是分组求和:
gen ysum = ""
levelsof group, local(group)
foreach i in `group' {
di "`i'"
local `i' = ""
forval j = 1/`=_N'{
if `=group[`j']' == `i'{
local `i' = "``i''`=y[`j']'"
}
}
replace ysum = "``i''" if group == `i'
}
list, sep(7)
*> +------------------------------------------------+
*> | group x y xsum xcumsum yy ysum |
*> |------------------------------------------------|
*> 1. | 1 12 A 250 12 AA ABCDEFG |
*> 2. | 1 2 B 250 14 BB ABCDEFG |
*> 3. | 1 35 C 250 49 CC ABCDEFG |
*> 4. | 1 56 D 250 105 DD ABCDEFG |
*> 5. | 1 34 E 250 139 EE ABCDEFG |
*> 6. | 1 24 F 250 163 FF ABCDEFG |
*> 7. | 1 87 G 250 250 GG ABCDEFG |
*> |------------------------------------------------|
*> 8. | 2 1 a 164 1 aa abcdefg |
*> 9. | 2 36 b 164 37 bb abcdefg |
*> 10. | 2 78 c 164 115 cc abcdefg |
*> 11. | 2 34 d 164 149 dd abcdefg |
*> 12. | 2 12 e 164 161 ee abcdefg |
*> 13. | 2 3 f 164 164 ff abcdefg |
*> 14. | 2 0 g 164 164 gg abcdefg |
*> +------------------------------------------------+
- 实现 gen 的 sum() 功能,也就是分组累加:
gen ycumsum = ""
levelsof group, local(group)
foreach i in `group' {
local `i' = ""
forval j = 1/`=_N'{
if `=group[`j']' == `i'{
local `i' = "``i''`=y[`j']'"
}
replace ycumsum = "``i''" if group == `i' & _n == `j'
}
}
list, sep(7)
*> +----------------------------------------------------------+
*> | group x y xsum xcumsum yy ysum ycumsum |
*> |----------------------------------------------------------|
*> 1. | 1 12 A 250 12 AA ABCDEFG A |
*> 2. | 1 2 B 250 14 BB ABCDEFG AB |
*> 3. | 1 35 C 250 49 CC ABCDEFG ABC |
*> 4. | 1 56 D 250 105 DD ABCDEFG ABCD |
*> 5. | 1 34 E 250 139 EE ABCDEFG ABCDE |
*> 6. | 1 24 F 250 163 FF ABCDEFG ABCDEF |
*> 7. | 1 87 G 250 250 GG ABCDEFG ABCDEFG |
*> |----------------------------------------------------------|
*> 8. | 2 1 a 164 1 aa abcdefg a |
*> 9. | 2 36 b 164 37 bb abcdefg ab |
*> 10. | 2 78 c 164 115 cc abcdefg abc |
*> 11. | 2 34 d 164 149 dd abcdefg abcd |
*> 12. | 2 12 e 164 161 ee abcdefg abcde |
*> 13. | 2 3 f 164 164 ff abcdefg abcdef |
*> 14. | 2 0 g 164 164 gg abcdefg abcdefg |
*> +----------------------------------------------------------+
这样就完成了!
R 语言的实现
使用 R 语言完成这个工作更容易些,首先我们读取上面生成的示例数据:
library(tidyverse)
haven::read_dta('strsum.dta') -> df
df
#> # A tibble: 14 x 3
#> group x y
#> <dbl> <dbl> <chr>
#> 1 1 12 A
#> 2 1 2 B
#> 3 1 35 C
#> 4 1 56 D
#> 5 1 34 E
#> 6 1 24 F
#> 7 1 87 G
#> 8 2 1 a
#> 9 2 36 b
#> 10 2 78 c
#> 11 2 34 d
#> 12 2 12 e
#> 13 2 3 f
#> 14 2 0 g
x 变量是数值型变量,分组求和或累加都很容易:
df %>%
group_by(group) %>%
mutate(xsum = sum(x)) %>%
mutate(xcumsum = cumsum(x)) %>%
ungroup()
#> # A tibble: 14 x 5
#> group x y xsum xcumsum
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 12 A 250 12
#> 2 1 2 B 250 14
#> 3 1 35 C 250 49
#> 4 1 56 D 250 105
#> 5 1 34 E 250 139
#> 6 1 24 F 250 163
#> 7 1 87 G 250 250
#> 8 2 1 a 164 1
#> 9 2 36 b 164 37
#> 10 2 78 c 164 115
#> 11 2 34 d 164 149
#> 12 2 12 e 164 161
#> 13 2 3 f 164 164
#> 14 2 0 g 164 164
那么对于字符串变量 y 该如何分组求和(连接)或者累加(逐次连接)呢?
根据我的经验,分组求和很简单:
df %>%
group_by(group) %>%
mutate(ysum = paste0(y, collapse = ""))
#> # A tibble: 14 x 4
#> # Groups: group [2]
#> group x y ysum
#> <dbl> <dbl> <chr> <chr>
#> 1 1 12 A ABCDEFG
#> 2 1 2 B ABCDEFG
#> 3 1 35 C ABCDEFG
#> 4 1 56 D ABCDEFG
#> 5 1 34 E ABCDEFG
#> 6 1 24 F ABCDEFG
#> 7 1 87 G ABCDEFG
#> 8 2 1 a abcdefg
#> 9 2 36 b abcdefg
#> 10 2 78 c abcdefg
#> 11 2 34 d abcdefg
#> 12 2 12 e abcdefg
#> 13 2 3 f abcdefg
#> 14 2 0 g abcdefg
分组累加我们先找一个向量试试:
letters
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r"
#> [19] "s" "t" "u" "v" "w" "x" "y" "z"
letters2 <- c()
for (i in 1:length(letters)) {
letters2[i] <- paste0(letters[1:i], collapse = "")
}
letters2
#> [1] "a" "ab"
#> [3] "abc" "abcd"
#> [5] "abcde" "abcdef"
#> [7] "abcdefg" "abcdefgh"
#> [9] "abcdefghi" "abcdefghij"
#> [11] "abcdefghijk" "abcdefghijkl"
#> [13] "abcdefghijklm" "abcdefghijklmn"
#> [15] "abcdefghijklmno" "abcdefghijklmnop"
#> [17] "abcdefghijklmnopq" "abcdefghijklmnopqr"
#> [19] "abcdefghijklmnopqrs" "abcdefghijklmnopqrst"
#> [21] "abcdefghijklmnopqrstu" "abcdefghijklmnopqrstuv"
#> [23] "abcdefghijklmnopqrstuvw" "abcdefghijklmnopqrstuvwx"
#> [25] "abcdefghijklmnopqrstuvwxy" "abcdefghijklmnopqrstuvwxyz"
这样我们就可以自己编写一个函数:
strcumsum <- function(x){
c <- c()
for (i in 1:length(x)) {
c[i] <- paste0(x[1:i], collapse = "")
}
return(c)
}
strcumsum(letters)
然后就可以用这个函数对字符串变量进行分组累加了:
df %>%
group_by(group) %>%
mutate(ysum = paste0(y, collapse = "")) %>%
mutate(ycumsum = strcumsum(y))
#> # A tibble: 14 x 5
#> # Groups: group [2]
#> group x y ysum ycumsum
#> <dbl> <dbl> <chr> <chr> <chr>
#> 1 1 12 A ABCDEFG A
#> 2 1 2 B ABCDEFG AB
#> 3 1 35 C ABCDEFG ABC
#> 4 1 56 D ABCDEFG ABCD
#> 5 1 34 E ABCDEFG ABCDE
#> 6 1 24 F ABCDEFG ABCDEF
#> 7 1 87 G ABCDEFG ABCDEFG
#> 8 2 1 a abcdefg a
#> 9 2 36 b abcdefg ab
#> 10 2 78 c abcdefg abc
#> 11 2 34 d abcdefg abcd
#> 12 2 12 e abcdefg abcde
#> 13 2 3 f abcdefg abcdef
#> 14 2 0 g abcdefg abcdefg