变量名
名字的长度要小于等于 32 个字节。(一个字母 1 个字节, 一个汉字 2 个字节)
以字母或下划线开头。
可以包含字母、 数字、 或者是下划线, 不能是%$!*&#@。
可以是小写或大写字母, 且不区分大小写
Missing numeric data are represented by a single period (.) and missing character data are represented by blanks.
library name
1-8个字符,字母或者下划线开头,剩余部分为字母,数字或者下划线
注释
星号开头 ;结尾
星号斜杠开头, 星斜杠结尾 asterisk (*)
DATA steps与PROC steps区别
The DATA statement does three things
- Tells SAS that a DATA step is starting.
- Names the SAS dataset being created.
- Set variables used in the DATA step to missing values
three default windows
1.program editor window
2.log window
3.output window
The basics of using SAS
- Prepare the SAS program
- Submit it for analysis
- Review the resulting log for errors
- Examine the output files to view the results of your analysis
Executing the program
- Pull down the Locals menu and select Submit.
- Click on the run icon on taskbar, which is a picture of a man running.
- Push F8.
- Highlight text and click on run symbol
- Note: DATA or PROC step is not executed until next DATA and PROC. Use RUN; statement to force execution.
读入dat文件;
DATA NAME;
INFILE 'E:\data\a.dat' FIRSTOBS=4 DLM=',';
INPUT V1 1-5 V2 5-10 V3 $ 15;
RUN;
PROC PRINT DATA=NAME; RUN;
infile控制
格式
INFILE 'AAAAA.DAT' XXX;
FIRSTOBS=行数 从哪一行开始读取数据
OBS=行数 一直读取到哪一行
MISSOVER 表示数据读到行末时,如果字段长度短于申明字段长度,则不从下一行读取数据,否则会自动从下一行读取数据
TURNCOVER column input中指定最长的一行
INPUT Notes
(1) Duplicate formats can be used when variables have the same format. The examples below represent the same formats of variables x1-x5.
INPUT x1 4. x2 4. x3 4. x4 4. x5 4.; INPUT (x1 x2 x3 x4 x5) (4. 4. 4. 4. 4.); INPUT (x1-x5) (5*4.);
(2) @@ tells SAS to hold the line of raw data and use it when processing the next
observation. The @@ must be the last entry in the INPUT statement.
(3) @ tells SAS to hold this line of data for possible use by INPUT statements later in theDATA step. The @ must be the last entry in the INPUT statement.
(4) / tells SAS to move to the next line of the raw dataset.
(5) #n tells SAS to skip to the nth line of the raw data for the observation.
(6) @n tells SAS to move to the nth column.
特殊字符
@40 跳至第40列 @‘aa’ 跳至aa后面
斜线/ 跳至原始数据第二行
#2 跳至某观测值第二行
重复观测值,将@@放在input句尾
input句尾加@, trailing at, 可用来选择部分数据, 看例子
数据步读取分隔符文件 delimited files
DLM=',' 指定逗号分隔符 '09'x Tab分隔符
DSD 忽略引号中数据的分隔符,例如一个观测 Joseph,76,"Red Racers, Washington"非引号中的逗号能识别成分隔符, 而引号中的逗号不能识别; 自动将字符串中的引号去掉; 将两个相邻的分隔符当作缺失值来处理。
Excel数据读取
PROC IMPORT DATAFILE='D:\A.XLS' OUT=A REPLACE DBMS=XLS; GETNAMES=YES; SHEET="Sheet1"; RUN;
PROC PRINT DATA=A; RUN;
OUT= 输出数据集名称
DBMS= XLS XLSX
sas7dbat文件读取 (桌面上的文件)
data new; set 'C:\Users\sdkyc\Desktop\hsb2.sas7bdat'; run;
proc print data=new; run;
数据集是临时还是永久
变量赋值与运算
IF-THEN DO IF-ELSE
- DO 与END 是一个组合,内部actions都会被执行
DATA A;
INFILE 'C:\A.DAT';
INPUT V1 $ V2 V3;
IF V2 = . THEN V4='MISSING';
ELSE IF V2<100 THEN V4='LOW';
ELSE IF V2<1000 THEN V4='MEDIUM';
ELSE V4 = 'HIGH';
RUN;
- 可以用来构造子集
使用数组简化程序 ARRAY
ARRAY array-name <{n}> <$> <length> <elements> <(initialvalues)>;
array-name - is the name of the array.
{n} - is either the dimension of the array, or an asterisk (*) to indicate that the dimension is determined from the number of array elements or initial values.
$ indicates that the array type is character.
length - is the maximum length of elements in the array. For character arrays, the maximum length cannot exceed 200.
elements - are the variables that make up the array and they exist in a dataset or are created before the array definition.
initial-values - are the values to use to initialize some or all of the array elements. Separate these values with commas or blanksARRAY rain {5} janr febr marr aprr mayr; ARRAY days{7} d1-d7; ARRAY month{*} jan feb jul oct nov; ARRAY x{*} _NUMERIC_; ARRAY qbx{10}; ARRAY meal{3};
关于各个PROC的note链接
https://stats.idre.ucla.edu/other/annotatedoutput/
PROC CONTENTS 获取数据集的描述部分,不包括数据本身
PROC MEANS
输出一些Descriptive Statistics 功能与univariate重复
maxdec 小数位个数
proc means data=a N NMISS MEAN STD STDERR MAXDEC=4; run;
PROC UNIVARIATE t-test sample mean mu0
Test for location就是一个two-tail的t-test,查看student's t value,如果P<α,wirte的平均值不等于30.
proc univariate data = "D:\hsb2" plots normal mu0=30; var write; run;
用来测试normality,画plot图找到Shapiro-Wilk P value大于α,正态分布
proc univariate data=a normal plot; var write; run;
1.These tests check the assumption that the data is distributed as a normal distribution.
2.Null hypothesis: data is normal vs Alternate hypothesis: data not normal.
3.P-value large (eg > 0.05) indicate the data follow normal (we accept the null hypothesis) .
4.If 6 < sample size < 2001 use Shapiro-Wilk.
5.Sample size > 2000 use Kolmogorov-Smirnov test.
6.Within the appropriate sample size range Shapiro-Wilk is more powerful than Kolmogorov-Smirnov test.
7.Any departure from Skewness =0 and kurtosis = 0 implies non normality.
PROC FREQ TABLES chisq
用来测试变量之间有无association,相互是否独立。找到输出结果中chi-square值,大值对应小p-value。如果P<α,两个变量有相关关系,不相互独立。
English: A large chi-square statistic will correspond to small p-value. If the p-value is small enough (say < 0.05), then we will reject the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.
PROC FREQ DATA=CLASSFIT2; TABLES SEX*HT/CHISQ; RUN;
PROC REG
Assumption
a.Normality of errors: The error distribution is normal.
b.Normality of errors is checked by doing residual analysis. In residual analysis we first calculate the residuals (r = y - ( 𝑦) ̂𝑝𝑟𝑒𝑑𝑖𝑐𝑡) then verify the normality of the residuals using proc univariate or Q-Q plots.
c.Independence: The errors or observations are independent of each other. Example: apple stock price recorded on 10 consecutive days. Here the 10 observations are not independent
d.变量必须是numerical value
PROC ANOVA
Assumption sampled populations are normally distributed.
one-way ANOVA----only one factor (一个变量,这个变量可以有几个level)
查看ppt
PROC GLM contrast
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glm_toc.htm
1.问题:不同年龄的身高平均值相同吗?μ1=μ2=μ3=μ4
proc glm data=a; class age; model height=age; run;
2.问题: 11岁与12岁孩子的平均身高13-16岁孩子的平均身高有区别吗
proc glm data=a; class age;
model height=age;
contrast '11&12 vs. rest'
age 2 2 -1 -1 -1 -1; run; quit;
PROC CORR
查看变量间的相关系数 pearson correlation coefficients,负值 负相关;正值正相关。
nosimple 不显示Descriptive Statistics
proc corr data = "D:\hsb2" pearson nosimple; var read write; run;
PROC TTEST t-test
Assumption: all variables are normally distributed.
- Single sample t-test 例子:检验score的平均值是否与50相同, p小于α,显著不同
proc ttest data="D:\hsb2" H0=50; var score; run;
- Dependent group t-test (paired t-test) 例子:一群学生都考了两门考试,学生的write 成绩与read成绩的平均值是否相同, p小于α,显著不同
proc ttest data="D:\hsb2"; paired write*read; run;
- Independent group t-test 例子:男女性别对write成绩有无影响
如果equality of variances Pr>F的值小于α, 那么两个性别group的variance不同,必须选择Satterthwaite (unequal)方法,然后查看这个方法对应的Pr>|t|
如果equality of variances Pr>F的值小于α,选Satterhwaite,否则选pooled
proc ttest data="D:\hsb2"; class sex; var write; run;
PROC NPAR1WAY
可以用来Wilcoxon test,问题举例:
Are test scores different from 4th grade to 5th grade on the same students?
Does a particular diet drug have an effect on BMI when tested one the same individuals?
该test的假设是:
Data comes from two matched, or dependent, populations.
The data is continuous.
Because it is a non-parametric test it does not require a special distribution of the dependent variable in the analysis. 对数据的distribution不做要求!!
尤其适用small sample size
one- and two-tail test
P value
如果 test H0=0,结果p<α 那么reject the H0,the mean is significantly different from 0.
预制代码
proc print data= ; run;