What to do? Why? How?

在做研究的过程中，我们应该把更多的时间和精力放在高水平的认知加工上（例如，理解他人的研究内容、探索数据的特征、表达自己的研究发现），而一些机械性的、重复性的工作，则可以在电脑的协助下完成。

例如现在微软的Copilot（一个嵌入到Office软件的类似ChatGPT的AI语言模型）已经能够按照用户的要求自动地完成表格的制作、总结数据特征等等。这似乎已经暗示了未来的研究者做学术研究的模式：只要是电脑能做的就交给电脑来做，只有需要运用到人类智慧的事情才由人来完成。

不过在这类技术彻底普及之前，我们可以先学习使用编程工具来快速、高效地完成一些事情，逐渐地将自己的工作模式转变过来。作为例子，本文将介绍如何通过R语言中的table1函数自动地生成APA格式的描述统计表。

1 示例数据

先导入需要的包。

library(psych)
library(dplyr)
library(car)
library(table1)
library(flextable)

这里我们用到了psych包的名为bfi的数据集，该数据包括25道测量大五人格的题目，以及性别、年龄和教育水平。

data(bfi)

我们首先整理一下数据，包括：(1)去掉存在缺失值的数据（正常情况下应该进行插补，这里是为了方便），(2)计算每个维度的分数，(3)将性别、教育编码为分类变量，并对三个人口学变量进行重新命名，(4)只保留整理好的变量。

考虑到大多数研究会涉及到分组变量（例如，多类人群），我们也可以定义一个分组变量作为例子。这个数据集很有趣的一点是，数据中有不少还在读书的老年人，因此我们可以把年龄大于等于60岁的人划分为“elderly students”，18-60岁的人划分为“adult students”，18岁以下为“adolescent students”，作为一个分组变量。代码如下。

bfi <- na.omit(bfi)

bfi$Agreeableness <- bfi$A1 + bfi$A2 + bfi$A3 + bfi$A4 + bfi$A5
bfi$Conscientiousness <- bfi$C1 + bfi$C2 + bfi$C3 + bfi$C4 + bfi$C5
bfi$Extraversion <- bfi$E1 + bfi$E2 + bfi$E3 + bfi$E4 + bfi$E5
bfi$Neuroticism <- bfi$N1 + bfi$N2 + bfi$N3 + bfi$N4 + bfi$N5
bfi$Openness <- bfi$O1 + bfi$O2 + bfi$O3 + bfi$O4 + bfi$O5

bfi <- bfi %>% mutate(Gender = case_when(gender == 1 ~ "Male", TRUE ~ "Female"))
bfi <- bfi %>% mutate(Education = case_when(
  education == 1 ~ "HS", 
  education == 2 ~ "finished HS", 
  education == 3 ~ "some college", 
  education == 4 ~ "college graduate", 
  education == 5 ~ "graduate degree", 
  TRUE ~ "Error"))
bfi$Gender <- factor(bfi$Gender)
bfi$Education <- factor(bfi$Education)
bfi$Age <- bfi$age

bfi <- bfi %>% mutate(Group = case_when(
  Age >= 60 ~ "Elderly students",
  Age >= 18 & Age < 60  ~ "Adult students",
  Age < 18 ~ "Adolescent students",
  TRUE ~ "Error"))
bfi$Group <- factor(bfi$Group)

bfi <- subset(bfi, select=c(Agreeableness, 
                            Conscientiousness,
                            Extraversion,
                            Neuroticism,
                            Openness,
                            Gender,
                            Education,
                            Age,
                            Group))

整理好的数据如下。

2 生成描述统计表格

接下来通过table1包中的table1函数生成描述统计表格。需要报告描述统计结果的变量放在~右边，通过+分开，分组变量则放在|之后。其实这个写法类似于我们用R语言做各种统计模型时的formula。

tab <- table1(~ 
                Age +
                Gender +
                Education +
                Agreeableness + 
                Conscientiousness + 
                Extraversion + 
                Neuroticism +
                Openness 
              | Group, 
              data=bfi)
              
tab

然后你就会在RStudio的Viewer界面看到生成的表格，如下。

3 美化表格

到这一步基本就没问题了。不过做研究总是要精益求精，仔细看看生成的表格，就会发现有不少地方需要修改：(1)小数点的位数不一致，有的是保留一位小数，有的则是保留两位小数；(2)除了均值和标准差，还存在中位数和最大最小值，这些额外的指标可以考虑去掉；(3)一般描述统计表中都会有组间比较的结果，但目前的表格没有。

前两个问题可以通过添加自定义的fucntions来解决。其中my.render.cate负责保留分类变量（category variable）的两位小数，my.render.cont负责保留连续变量（continuous variable）的两位小数。去掉中位数和最大最小值的方法就是在用my.render.cont来render的时候只设置MEAN和SD。

# a function to round the values of category variables
my.render.cate <- function(x) {
  sub('.', '.', c("", sapply(stats.default(x),
                             function(y) with(y, sprintf("%d (%0.2f %%)", 
                                                         FREQ, PCT)))), fixed=T)
}

# a function to round the values of continuous variables
my.render.cont <- function(x) {
  with(stats.default(x),
       sprintf("%0.2f (%0.2f)", MEAN, SD))
}

我们可以再自定义一个function，用来计算组间差异的p值，其中分类变量用卡方检验，连续变量用方差分析，计算完成后提取p值并进行格式化（保留三位小数，小于0.001时用<0.0001替换原始值，并标注星号以突出显著的结果）。

# a function to provide p-values that indicate differences among groups
pvalue <- function(x, ...) {
  x <- x[-length(x)]  # Remove "overall" group
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform an ANOVA
    p <- summary(aov(y ~ g))[[1]][["Pr(>F)"]][1]
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value
  if (p < 0.001) {
    p = '< 0.001***'
  } else if (p < 0.01) {
    p = sprintf("%.3f**", p)
  } else if (p < 0.05) {
    p = sprintf("%.3f*", p)
  } else {
    p = sprintf("%.3f", p)
  }
  
  return(p)
}

然后只需要在table1函数中引用我们自定义的三个函数。

tab <- table1(~ 
                Age +
                Gender +
                Education +
                Agreeableness + 
                Conscientiousness + 
                Extraversion + 
                Neuroticism +
                Openness 
              | Group, 
              data=bfi, 
              render.continuous=my.render.cont,
              render.categorical=my.render.cate,
              extra.col=list(`P-value`=pvalue))
              
tab

修改后的表格如下。

4 导出表格

其实在Viewer里就可以导出。这里提供一个通过代码导出表格的方法，主要是通过flextable包中的t1flex函数。我一般喜欢导出为html格式，因为看起来和Viewer里的表格一致，并且这个html文件也可以右键选择用Word打开进行复制或编辑。

t1flex(tab) %>% save_as_html(path="description_table.html")

5 小结

所有代码如下。

library(psych)
library(dplyr)
library(car)
library(table1)
library(flextable)

# import data ------------------------------------------------------------------
data(bfi)

# re-code ----------------------------------------------------------------------
bfi <- na.omit(bfi)

bfi$Agreeableness <- bfi$A1 + bfi$A2 + bfi$A3 + bfi$A4 + bfi$A5
bfi$Conscientiousness <- bfi$C1 + bfi$C2 + bfi$C3 + bfi$C4 + bfi$C5
bfi$Extraversion <- bfi$E1 + bfi$E2 + bfi$E3 + bfi$E4 + bfi$E5
bfi$Neuroticism <- bfi$N1 + bfi$N2 + bfi$N3 + bfi$N4 + bfi$N5
bfi$Openness <- bfi$O1 + bfi$O2 + bfi$O3 + bfi$O4 + bfi$O5

bfi <- bfi %>% mutate(Gender = case_when(gender == 1 ~ "Male", TRUE ~ "Female"))
bfi <- bfi %>% mutate(Education = case_when(
  education == 1 ~ "HS", 
  education == 2 ~ "finished HS", 
  education == 3 ~ "some college", 
  education == 4 ~ "college graduate", 
  education == 5 ~ "graduate degree", 
  TRUE ~ "Error"))
bfi$Gender <- factor(bfi$Gender)
bfi$Education <- factor(bfi$Education)
bfi$Age <- bfi$age

bfi <- bfi %>% mutate(Group = case_when(
  Age >= 60 ~ "Elderly students",
  Age >= 18 & Age < 60  ~ "Adult students",
  Age < 18 ~ "Adolescent students",
  TRUE ~ "Error"))
bfi$Group <- factor(bfi$Group)

bfi <- subset(bfi, select=c(Agreeableness, 
                            Conscientiousness,
                            Extraversion,
                            Neuroticism,
                            Openness,
                            Gender,
                            Education,
                            Age,
                            Group))

# functions --------------------------------------------------------------------
# a function to round the values of category variables
my.render.cate <- function(x) {
  sub('.', '.', c("", sapply(stats.default(x),
                             function(y) with(y, sprintf("%d (%0.2f %%)", 
                                                         FREQ, PCT)))), fixed=T)
}

# a function to round the values of continuous variables
my.render.cont <- function(x) {
  with(stats.default(x),
       sprintf("%0.2f (%0.2f)", MEAN, SD))
}

# a function to provide p-values that indicate differences among groups
pvalue <- function(x, ...) {
  x <- x[-length(x)]  # Remove "overall" group
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform an ANOVA
    p <- summary(aov(y ~ g))[[1]][["Pr(>F)"]][1]
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value
  if (p < 0.001) {
    p = '< 0.001***'
  } else if (p < 0.01) {
    p = sprintf("%.3f**", p)
  } else if (p < 0.05) {
    p = sprintf("%.3f*", p)
  } else {
    p = sprintf("%.3f", p)
  }
  
  return(p)
}

# generate the table
tab <- table1(~ 
                Age +
                Gender +
                Education +
                Agreeableness + 
                Conscientiousness + 
                Extraversion + 
                Neuroticism +
                Openness 
              | Group, 
              data=bfi, 
              render.continuous=my.render.cont,
              render.categorical=my.render.cate,
              extra.col=list(`P-value`=pvalue))
              
tab

t1flex(tab) %>% save_as_html(path="description_table.html")

如果你将代码应用到自己的数据但是出错了，其中一个可能的原因是变量类型没有定义好，因为三个自定义函数都涉及到特定的变量类型。如果是其它原因或者你想要进一步修改代码，可以自己去学习table1的tutorial，或者问问ChatGPT。

总结

当然，SPSS也可以将结果表设置为APA格式的三线表（虽然似乎很多人不知道），但如果主要通过R语言完成数据分析，则table1其实是不错的选择。

table1的优点包括：(1)表格的自定义程度较高，调整好参数的表格甚至可以直接放到论文中使用，(3)只需要稍微修改代码就可以快速地应用到其它数据。

table1的缺点包括：(1)需要具备R语言基础，(2)实在是太高效了，所以要小心不能被老板看到，不然就不能安心摸鱼了。

【R】通过table1函数自动生成APA格式的描述统计表