写在前面。
新手小白,疏漏在所难免,如果您看完有所指正,那我会很感谢您。如果您看完有所收获,那是我的荣幸。
这个宏和之前MeanT宏在编写思路上没有什么本质上的区别。因此本文的很多叙述会省略简化。
(长文慎入)SAS编程实践---宏:从头手搓一个生成”基本描述性统计量表格“的宏
MeanT
只有一个试验组的分组变量
,而本文的MeanbyT
宏有多个分组变量
,因此需要进行分组宏变量的处理
,同时目标表格的样式
也不同,在统计后步骤也需要处理。
还是坚持编写宏程序的总体结构
,按照这个框架来组织代码。
* _1. pre-processing;
* _2. main statistical steps;
* _3. processing step of stat;
* _4. output steps;
- 预处理(宏变量的处理和输入数据集的预处理)
- 主要的统计步骤
- 统计步骤后的处理
- 输出步骤
准备步骤
目标表格拆解分析
先对目标表格进行拆解,在图中对于各部分对应的宏参数
进行了标注。
可以和后面注释中对于宏参数的说明
进行对应。
模拟数据生成
分析数据集
生成分析数据集模拟数据
的代码如下,
%let seed1 = 666666;
data ad0 ;
do i = 1 to 3;
armn = i;
do j = 1 to 50;
usubjid = cats("X","-", put(i, best.) ,"-", put(j, z3.) );
do k = 1 to 3;
avisitn = k;
avisit = cats("访视",strip(put(k, best.)));
do h = 1 to 2;
parcat1n = h;
parcat1 = cats("分类",strip(put(h, best.)));
do m = 1 to 5;
PARAMN = input( cats( put(PARCAT1n, best.) , put( m, z2.) ) , best.);
param = cats("项目",strip( cats(put(h, best.), put(m, Z2.))) );
aval = round(ranuni(&seed1.)*100 , 0.01);
output;
end;
end;
end;
end;
end;
run;
data ad;
set ad0;
if _N_ in (456 35 1982 12 2345 1117 2345 673 2676 290) then delete;
if _N_ in (1456 325 1282 127 347 2645 3676 4290) then aval = .;
run;
同时人为地构造了缺失值
和缺失观测
,这些缺失情况在实际数据中都是有可能存在,同时也是为了体现adsl
在计算nmiss
时产生的作用。
adsl数据集
data adsl;
do i = 1 to 3;
armn = i;
do j = 1 to 50;
usubjid = cats("X","-", put(i, best.) ,"-", put(j, z3.) );
output;
end;
end;
run;
宏编写步骤
本文的写作顺序
和宏代码的顺序一致
,线型叙述。
注释
之前的文章中分享的宏都没有展示注释
部分,其实,比较好的习惯的还是要有注释的,且在其中把宏的作用
和宏参数的形式作用
说清楚的。
Purpose:统计多个分组变量下的某一数值型变量的描述性统计量;
libin:输入数据集所在逻辑库,如work;
dtin:用于分析的输入数据集,可以跟筛选条件,也可以提前处理好,如adlb(where = FASFL = "是");
adsl:adsl数据集,用于确定受试者数量,主要用于计算nmiss;
var:待分析的数值型变量;
grpvarn:受试者实验分组变量,数值型;
byvarc:多个分组变量,字符型,中间用|隔开,如AVISIT|PARAM;
byvarn:多个分组变量,数值型,中间用|隔开,需要与byvarc一一对应,且顺序一致,如AVISITN|PARAMN;
rowsumyn:指定是否计算合计列,取值限定为Y或者N;
libout:输出数据集所在逻辑库,如work;
dtout:输出数据集名称,如Table1;
宏变量的处理
我会把输入数据集
和adsl
数据集输入宏之后都复制给新的临时数据集
,只是个人习惯,要说好处,更多是为了溯源出错
的地方。
* _1. pre-processing;
data _stdt0;
set &libin..&dtin.;
run;
data _adsl;
set &adsl.;
run;
小数点位数处理
小数点位数处理的原则
和上文原则一样,
-
n
和nmissing
都是整数部分;
min
和max
指标保留位数,和原始数据最大小数位数一致
;- 当原始数据最大小数位数
(decmax)
小于等于2
时:mean、median、q1
和q3
保留decmax + 1
位;std
保留decmax + 2
位;- 当原始数据最大小数位数
(decmax)
等于3
时:mean、median、q1
和q3
保留decmax + 1
位;std
保留4
位;- 当原始数据最大小数位数
(decmax)
大于3
时:mean、median、q1、q3
和std
均保留4
位;
* _1.1 decimals;
data _dec;
set _stdt0;
dec = length(scan(strip(put(&var., best.)),2,"."));
if not index(&var.,".") then
dec = 0;
run;
proc sql;
select max(dec) into: decmax
from _dec
;
quit;
%put 最大小数位数:&decmax.;
分层统计宏变量的处理
分层变量
对应目标表格
前几列的那些指标,如avisitn
,但是输入时是|
分割的,无法直接用于排序或者统计步骤中,需要进行处理。
我的常规做法是进行替换
,将|
替换为空格
。
* _1.2 byvar;
%let byvcn = %sysfunc(countw( &byvarc. , %str(|)));
%let byvnn = %sysfunc(countw( &byvarn. , %str(|)));
%put &byvcn. &byvnn.;
%let _byvarc = %sysfunc( tranwrd( &byvarc. , %str(|) , %str( )) );
%let _byvarn = %sysfunc( tranwrd( &byvarn. , %str(|) , %str( )) );
%put &_byvarc. &_byvarn.;
输入数据的预处理
分组变量在进行排序和统计时我更习惯使用数值型变量,但是可以看到,目标表格使用的则是字符型变量,因此需要构建一个两者一一对应的数据集,用于在统计步骤之后,把数值型替换为字符型用于展示。
* _1.2 input data processing;
* _1.2.1 group variable datasets ;
data _byvar;
set _stdt0;
keep &_byvarc. &_byvarn.;
proc sort nodupkey; by &_byvarn. ;
run;
分析数据集
和adsl
数据集构建用于确定计算合计列
,本宏没有使用arm
字符型分组变量,我写个arm = "合计"
只是为了方便报错溯源
。
* _1.2.2 sum datasets;
data _adsum0;
set _stdt0;
&grpvarn. = 999;
arm = "合计";
run;
data _adslall;
set _adsl;
&grpvarn. = 999;
arm = "合计";
run;
核心统计步骤
分组统计
- 使用
proc means
根据分组变量统计待分析变量&var.
;
* _2. main statistical step;
* _2.1 by group;
* _2.1.1 stat by group;
proc sort data=_stdt0;
by &_byvarn.;
run;
proc means data=_stdt0 noprint;
var &var.;
by &_byvarn.;
class &grpvarn.;
output out=_bygrpm0(where = (not missing(&grpvarn.)) ) n=n mean=mean std= std median= mid q1=q1 q3=q3 min=min max=max;
proc sort;
by &grpvarn. &_byvarn.;
run;
注意,我这里对于试验分组变量
使用了class
语句,使用by
也可以,但是需要在前面的排序步骤中把&grpvarn.
加上。
我这么写,只是为了看起来有所区别,思路更清晰些。
- 分组统计
adsl
数据集中受试者的数量,为了计算nmissing
指标;
* _2.1.2 number caculation by group;
proc freq data=_adsl noprint;
table &grpvarn. / out=_bygrpnum0(drop = percent);
proc sort;
by &grpvarn.;
run;
- 合并上述两个数据集
* _2.1.3 combine;
data _bygrp;
merge _bygrpm0
_bygrpnum0
;
by &grpvarn.;
nmiss = count - n;
run;
合计列
- 合计列统计量的计算
* _2.2 sum;
* _2.2.1 stat of all subject;
proc sort data=_adsum0;
by &_byvarn.;
run;
proc means data=_adsum0 noprint;
var &var.;
by &_byvarn.;
class &grpvarn.;
output out=_summ0(where = (not missing(&grpvarn.)) ) n=n mean=mean std= std median =mid q1=q1 q3=q3 min=min max=max;
proc sort;
by &grpvarn. &_byvarn.;
run;
- 受试者总数,为了构建nmissing
* _2.2.2 number caculation of all subject;
proc freq data=_adslall noprint;
table &grpvarn. / out=_sumnum0(drop = percent);
proc sort;
by &grpvarn.;
run;
- 合并上述两个数据集
* _2.2.3 combine;
data _sum;
merge _summ0
_sumnum0
;
by &grpvarn.;
nmiss = count - n;
run;
统计后处理步骤
是否计算合计列
这里使用了条件判断
,rowsumyn
的赋值用于确定合计列的数据
和分组计算的数据
要不要set
在一起。
* _3 processing step of stat;
* _3.1 processing step of stat;
%if %sysfunc(upcase(&rowsumyn.)) = %str(Y) %then
%do;
%put WARNING: 将计算合计列;
data _dt4comb;
set _bygrp
_sum;
run;
%end;
%else
%do;
%put WARNING: 不计算合计列;
data _dt4comb;
set _bygrp;
run;
%end;
统计量的构造
根据目标表格
中的样式,以及小数点位数保留规则
,进行展示用统计指标的构建。
* _3.2 Variable construction of the target format;
data _combdt;
length _1nnmiss _2meansd _3median _4q1q3 _5minmax $200.;
set _dt4comb;
_1nnmiss =cats( strip(put(n ,8.0)) , "(",strip(put(count - n , 8.0)), ")");
_5minmax = cats( strip(put( round(min , 10**-( &decmax.) ) , 8.%eval( &decmax.)) ) , "~" ,strip(put( round(max , 10**-( &decmax.) ) , 8.%eval( &decmax.)) ) );
%if &decmax. <= 2 %then
%do;
_2meansd = cats( strip(put( round(mean , 10**-( &decmax. + 1) ), 8.%eval( &decmax. + 1) ) ) , "(" ,strip(put( round(std, 10**-( &decmax. + 2)), 8.%eval( &decmax. + 2)) ) , ")");
_3median = cats( strip(put( round(mid , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
_4q1q3 = cats( strip(put( round(q1 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) , "~" ,strip(put( round(q3 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
%end;
%if &decmax. = 3 %then
%do;
_2meansd = cats( strip(put( round(mean , 10**-( &decmax. + 1) ), 8.%eval( &decmax. + 1) ) ) , "(" ,strip(put( round(std, 0.0001), 8.4) ) , ")");
_3median = cats( strip(put( round(mid , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
_4q1q3 = cats( strip(put( round(q1 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) , "~" ,strip(put( round(q3 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
%end;
%if &decmax. >3 %then
%do;
_2meansd = cats( strip(put( round(mean , 0.0001 ), 8.4 ) ) , "(" ,strip(put( round(std, 0.0001), 8.4) ) , ")");
_3median = cats( strip(put( round(mid , 0.0001 ) , 8.4 ) ) );
_4q1q3 = cats( strip(put( round(q1 ,0.0001 ) , 8.4 ) ) , "~" ,strip(put( round(q3 , 0.0001 ) , 8.4 ) ) );
%end;
keep &grpvarn. &_byvarn. _1nnmiss _2meansd _3median _4q1q3 _5minmax;
proc sort;
by &grpvarn. &_byvarn.;
run;
数据塑性
proc transpose
数据塑形的能力还是很强的,很实用。
* _3.3 transpose;
proc transpose data=_combdt out=_trans1;
var _1nnmiss _2meansd _3median _4q1q3 _5minmax;
by &grpvarn. &_byvarn.;
run;
data _trans1_;
length _NAME $200.;
set _trans1;
_NAME= _NAME_;
drop _NAME_;
proc sort;
by &_byvarn. _NAME;
run;
proc transpose data=_trans1_ out=_trans2 prefix= _grp;
id &grpvarn.;
var col1;
by &_byvarn. _NAME;
proc sort;
by &_byvarn. _NAME;
run;
和分组变量联接获得字符型变量
* _3.4 merge with the group dataset;
data _mergegrp;
length &_byvarc. $200.;
merge _trans2(in = ina)
_byvar;
by &_byvarn.;
if ina;
proc sort;
by &_byvarn. _NAME;
run;
目标表格中空白行和列的构造
我在构建统计指标时考虑到了这一步,因此命名也比较固定,所以借此构造了两个flag
变量。
* _3.5 the blank row and the blank column making;
data _mkblank0;
set _mergegrp(drop=_NAME_);
by &_byvarn. _NAME;
if _NAME = '_1nnmiss' then
flag1 = 1;
else flag1 = 2;
output;
if _NAME = '_1nnmiss' then
do;
flag2 = 2;
output;
end;
proc sort;
by &_byvarn. _NAME;
run;
data _mkblank1;
set _mkblank0;
if flag1 = 2 and flag2 ^= 2 then
do;
array temp1 ( &byvnn. ) &_byvarn.;
do cc = 1 to &byvnn.;
call missing(temp1[cc]);
end;
array temp2 ( &byvcn. ) &_byvarc.;
do dd = 1 to &byvcn.;
call missing(temp2[dd]);
end;
end;
if flag1 = 1 and flag2 ^= 2 then
do;
call missing( of _numeric_);
call missing( of _character_);
end;
drop flag1 flag2 cc dd &_byvarn.;
run;
统计指标名称修改
* _3.6 modify the name of items;
data _NAME;
set _mkblank1;
if _NAME = "_1nnmiss" then
_NAME = "n(nmisssing)";
if _NAME = "_2meansd" then
_NAME = "Mean(SD)";
if _NAME = "_3median" then
_NAME = "Median";
if _NAME = "_4q1q3" then
_NAME = "Q1~Q3";
if _NAME = "_5minmax" then
_NAME = "Min~Max";
drop &_byvarn.;
run;
列名的修改
列名
修改为C0,C1,C2,...
* _3.6 rename column;
proc contents data=_NAME out=_info noprint;
proc sort;
by varnum;
run;
proc sql noprint;
select count(distinct NAME) , NAME into: varn, :vnam1 -:vnam99 from _info;
quit;
data _renamed;
length C: $200.;
set _NAME;
%do ii = 1 %to &varn.;
%let jj = %eval(&ii. - 1);
rename &&vnam&ii. = C&jj.;
%end;
run;
数据集输出和清除临时数据集
data _&dtout.;
set _renamed;
if _N_ = 1 then
delete;
run;
* _4 output data ;
* _4.1 output ;
data &libout..&dtout.;
set &dtout.;
run;
* _4.2 remove temp datasets ;
proc datasets lib=work;
delete _:;
run;
总结
以上就是我写这个宏的最主要代码,完整代码如下:
%let path=E:\SAS\macro\macro ongoing\MeanbyT\;
%include "&path.sampleMeanbyT.sas";
%macro MeanbyT(
libin = ,
dtin = ,
adsl=,
var =,
grpvarn = ,
byvarc=,
byvarn=,
rowsumyn=,
libout=,
dtout=
);
* _1. pre-processing;
data _stdt0;
set &libin..&dtin.;
run;
data _adsl;
set &adsl.;
run;
* _1.1 decimals;
data _dec;
set _stdt0;
dec = length(scan(strip(put(&var., best.)),2,"."));
if not index(&var.,".") then
dec = 0;
run;
proc sql;
select max(dec) into: decmax
from _dec
;
quit;
%put 最大小数位数:&decmax.;
* _1.2 byvar;
%let byvcn = %sysfunc(countw( &byvarc. , %str(|)));
%let byvnn = %sysfunc(countw( &byvarn. , %str(|)));
%put &byvcn. &byvnn.;
%let _byvarc = %sysfunc( tranwrd( &byvarc. , %str(|) , %str( )) );
%let _byvarn = %sysfunc( tranwrd( &byvarn. , %str(|) , %str( )) );
%put &_byvarc. &_byvarn.;
* _1.2 input data processing;
* _1.2.1 group variable datasets;
data _byvar;
set _stdt0;
keep &_byvarc. &_byvarn.;
proc sort nodupkey;
by &_byvarn.;
run;
* _1.2.2 sum datasets;
data _adsum0;
set _stdt0;
&grpvarn. = 999;
arm = "合计";
run;
data _adslall;
set _adsl;
&grpvarn. = 999;
arm = "合计";
run;
* _2. main statistical step;
* _2.1 by group;
* _2.1.1 stat by group;
proc sort data=_stdt0;
by &_byvarn.;
run;
proc means data=_stdt0 noprint;
var &var.;
by &_byvarn.;
class &grpvarn.;
output out=_bygrpm0(where = (not missing(&grpvarn.)) ) n=n mean=mean std= std median= mid q1=q1 q3=q3 min=min max=max;
proc sort;
by &grpvarn. &_byvarn.;
run;
* _2.1.2 number caculation by group;
proc freq data=_adsl noprint;
table &grpvarn. / out=_bygrpnum0(drop = percent);
proc sort;
by &grpvarn.;
run;
* _2.1.3 combine;
data _bygrp;
merge _bygrpm0
_bygrpnum0
;
by &grpvarn.;
nmiss = count - n;
run;
* _2.2 sum;
* _2.2.1 stat of all subject;
proc sort data=_adsum0;
by &_byvarn.;
run;
proc means data=_adsum0 noprint;
var &var.;
by &_byvarn.;
class &grpvarn.;
output out=_summ0(where = (not missing(&grpvarn.)) ) n=n mean=mean std= std median =mid q1=q1 q3=q3 min=min max=max;
proc sort;
by &grpvarn. &_byvarn.;
run;
* _2.2.2 number caculation of all subject;
proc freq data=_adslall noprint;
table &grpvarn. / out=_sumnum0(drop = percent);
proc sort;
by &grpvarn.;
run;
* _2.2.3 combine;
data _sum;
merge _summ0
_sumnum0
;
by &grpvarn.;
nmiss = count - n;
run;
* _3 processing step of stat;
* _3.1 processing step of stat;
%if %sysfunc(upcase(&rowsumyn.)) = %str(Y) %then
%do;
%put WARNING: 将计算合计列;
data _dt4comb;
set _bygrp
_sum;
run;
%end;
%else
%do;
%put WARNING: 不计算合计列;
data _dt4comb;
set _bygrp;
run;
%end;
* _3.2 Variable construction of the target format;
data _combdt;
length _1nnmiss _2meansd _3median _4q1q3 _5minmax $200.;
set _dt4comb;
_1nnmiss =cats( strip(put(n ,8.0)) , "(",strip(put(count - n , 8.0)), ")");
_5minmax = cats( strip(put( round(min , 10**-( &decmax.) ) , 8.%eval( &decmax.)) ) , "~" ,strip(put( round(max , 10**-( &decmax.) ) , 8.%eval( &decmax.)) ) );
%if &decmax. <= 2 %then
%do;
_2meansd = cats( strip(put( round(mean , 10**-( &decmax. + 1) ), 8.%eval( &decmax. + 1) ) ) , "(" ,strip(put( round(std, 10**-( &decmax. + 2)), 8.%eval( &decmax. + 2)) ) , ")");
_3median = cats( strip(put( round(mid , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
_4q1q3 = cats( strip(put( round(q1 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) , "~" ,strip(put( round(q3 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
%end;
%if &decmax. = 3 %then
%do;
_2meansd = cats( strip(put( round(mean , 10**-( &decmax. + 1) ), 8.%eval( &decmax. + 1) ) ) , "(" ,strip(put( round(std, 0.0001), 8.4) ) , ")");
_3median = cats( strip(put( round(mid , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
_4q1q3 = cats( strip(put( round(q1 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) , "~" ,strip(put( round(q3 , 10**-( &decmax. + 1) ) , 8.%eval( &decmax. + 1) ) ) );
%end;
%if &decmax. >3 %then
%do;
_2meansd = cats( strip(put( round(mean , 0.0001 ), 8.4 ) ) , "(" ,strip(put( round(std, 0.0001), 8.4) ) , ")");
_3median = cats( strip(put( round(mid , 0.0001 ) , 8.4 ) ) );
_4q1q3 = cats( strip(put( round(q1 ,0.0001 ) , 8.4 ) ) , "~" ,strip(put( round(q3 , 0.0001 ) , 8.4 ) ) );
%end;
keep &grpvarn. &_byvarn. _1nnmiss _2meansd _3median _4q1q3 _5minmax;
proc sort;
by &grpvarn. &_byvarn.;
run;
* _3.3 transpose;
proc transpose data=_combdt out=_trans1;
var _1nnmiss _2meansd _3median _4q1q3 _5minmax;
by &grpvarn. &_byvarn.;
run;
data _trans1_;
length _NAME $200.;
set _trans1;
_NAME= _NAME_;
drop _NAME_;
proc sort;
by &_byvarn. _NAME;
run;
proc transpose data=_trans1_ out=_trans2 prefix= _grp;
id &grpvarn.;
var col1;
by &_byvarn. _NAME;
proc sort;
by &_byvarn. _NAME;
run;
* _3.4 merge with the group dataset;
data _mergegrp;
length &_byvarc. $200.;
merge _trans2(in = ina)
_byvar;
by &_byvarn.;
if ina;
proc sort;
by &_byvarn. _NAME;
run;
* _3.5 the blank row and the blank column making;
data _mkblank0;
set _mergegrp(drop=_NAME_);
by &_byvarn. _NAME;
if _NAME = '_1nnmiss' then
flag1 = 1;
else flag1 = 2;
output;
if _NAME = '_1nnmiss' then
do;
flag2 = 2;
output;
end;
proc sort;
by &_byvarn. _NAME;
run;
data _mkblank1;
set _mkblank0;
if flag1 = 2 and flag2 ^= 2 then
do;
array temp1 ( &byvnn. ) &_byvarn.;
do cc = 1 to &byvnn.;
call missing(temp1[cc]);
end;
array temp2 ( &byvcn. ) &_byvarc.;
do dd = 1 to &byvcn.;
call missing(temp2[dd]);
end;
end;
if flag1 = 1 and flag2 ^= 2 then
do;
call missing( of _numeric_);
call missing( of _character_);
end;
drop flag1 flag2 cc dd &_byvarn.;
run;
* _3.6 modify the name of items;
data _NAME;
set _mkblank1;
if _NAME = "_1nnmiss" then
_NAME = "n(nmisssing)";
if _NAME = "_2meansd" then
_NAME = "Mean(SD)";
if _NAME = "_3median" then
_NAME = "Median";
if _NAME = "_4q1q3" then
_NAME = "Q1~Q3";
if _NAME = "_5minmax" then
_NAME = "Min~Max";
drop &_byvarn.;
run;
* _3.6 rename column;
proc contents data=_NAME out=_info noprint;
proc sort;
by varnum;
run;
proc sql noprint;
select count(distinct NAME) , NAME into: varn, :vnam1 -:vnam99 from _info;
quit;
data _renamed;
length C: $200.;
set _NAME;
%do ii = 1 %to &varn.;
%let jj = %eval(&ii. - 1);
rename &&vnam&ii. = C&jj.;
%end;
run;
data _&dtout.;
set _renamed;
if _N_ = 1 then
delete;
run;
* _4 output data;
* _4.1 output;
data &libout..&dtout.;
set _&dtout.;
run;
* _4.2 remove temp datasets;
proc datasets lib=work noprint;
delete _:;
run;
%mend;
%MeanbyT(
libin =work ,
dtin = ad,
adsl= adsl,
var =aval,
grpvarn =armn ,
byvarc=avisit|parcat1|param,
byvarn=avisitn|parcat1n|paramn,
rowsumyn=Y,
libout=work,
dtout=TT2
);
新手小白,疏漏在所难免,如果您看完有所指正,那我会很感谢您。如果您看完有所收获,那是我的荣幸。