1.CENSORED DATA
Many researchers consider survival data analysis to be merely the application of two conventional statistical methods to a special type of problem: parametric if the distribution of survival times is known to be normal and nonparametric if the distribution is unknown.
This assumption would be true if the survival times of all the subjects were exact and known; however, some survival times are not.
Further, the survival distribution is often skewed, or far from being normal. Thus there is a need for new statistical techniques.
2.三种类型的Censoring
2.1.Type I Censoring
场景:Because of time and/or cost limitations, the researcher often cannot wait for the death of all the animals. One option is to observe for a fixed period of time, say six months, after which the surviving
animals are sacrificed.定义:
Survival times recorded for the animals that died during the study period are the times from the start of the experiment to their death. These are called exact or uncensored observations.The survival times of the sacrificed animals are not known exactly but are recorded as at least the length
of the study period. These are called censored observations. Some animals could be lost or die accidentally. Their survival times, from the start of experiment to loss or death, are also censored observations.In type I censoring, if there are no accidental losses, all censored observations equal the length of the study period.
2.2.Type II Censoring
场景:Another option in animal studies is to wait until a fixed portion of the animals have died, say 80 of 100, after which the surviving animals are sacrificed.
定义:
In this case, type II censoring, if there are no accidental losses, the censored observations equal the largest uncensored observation.
2.3.Type III Censoring
场景:In most clinical and epidemiologic studies the period of study is fixed and patients enter the study at different times during that period. Some may die before the end of the study; their exact survival times are known. Others may withdraw before the end of the study and are lost to follow-up. Still others may be alive at the end of the study.
定义:
For ‘‘lost’’ patients, survival times are at least from their entrance to the last contact. For patients still alive, survival times are at least from entry to the end of the study. The latter two kinds of observations are censored observations. Since the entry times are not simultaneous, the censored times are also different. This is type III censoring.
2.4.补充
Type I and type II censored observations are also called singly censored data, and type III, progressively censored data, by Cohen (1965). Another commonly used name for type III censoring is random censoring. All of these types of censoring are right censoring or censoring to the right. There are also left censoring and interval censoring cases. Left censoring occurs when it is known that the event of interest occurred prior to a certain time t, but the exact time of occurrence is unknown.
Interval censoring occurs when the event of interest is known to have occurred between times a and b.
3.三个主要函数
3.1.Survival Function
定义:
Survival Function is defined as the probability that an individual survives longer than :
For cumulative distribution function of ,
估计方法:
In practice, if there are no censored observations, the survivorship function is estimated as the proportion of patients surviving longer than :
Where the circumflex denotes an estimate of the function.When censored observations are present, the numerator cannot always be determined.
3.2.Density Function
定义:
The survival time has a probability density function defined as the limit of the probability that an individual fails in the short interval to per unit width , or simply the probability of failure in a small interval per unit time.
性质:
is a nonnegative function.
The area between the density curve and the axis is equal to 1.
估计方法:
In practice, if there are no censored observations, the probability density function is estimated as the proportion of patients dying in an interval per unit width.
When censored observations are present, it is not applicable.The density function is also known
as the unconditional failure rate.
3.3.Hazard Function
定义:
The hazard function of survival time gives the conditional failure rate. This is defined as the probability of failure during a very small time interval, assuming that the individual has survived to the beginning of the interval, or as the limit of the probability that an individual fails in a very short interval, , given that the individual has survived to time :
The hazard function can also be defined in terms of the cumulative distribution function and the probability density function :
The hazard function is also known as the instantaneous failure rate, force of
mortality, conditional mortality rate, and age-specific failure rate.It gives the risk of failure per unit time during the aging process.
The cumulative hazard function is defined as:
估计方法:
In practice, when there are no censored observations the hazard function is estimated as the proportion of patients dying in an interval per unit time, given that they have survived to the beginning of the interval:
Actuaries usually use the average hazard rate of the interval in which the number of patients dying per unit time in the interval is divided by the average number of survivors at the midpoint of the interval:
3.4.三个主要函数的内在联系
1.
2.
3.
4.
5.
4.Cox Proportional Hazards Model
4.1.背景
However, in practice, the exact form of the underlying survival distribution is usually unknown and we may not be able to find an appropriate model. Therefore, the use of parametric methods in identifying significant prognostic factors is somewhat limited.
The Cox (1972) proportional hazards model does not require knowledge of the underlying distribution. The hazard function in this model can take on any form, including that of a stepfunction, but the hazard functions of different individuals are assumed to be proportional and independent of time. The usual likelihood function is replaced by the partial likelihood function. The important fact is that the statistical inference based on the partial likelihood function is similar to that based on the likelihood function.
4.2.方法
This property implies that the hazard function given a set of covariates can be written as a function of an underlying hazard function and a function, say , of only the covariates, that is,
The underlying hazard function, , represents how the risk changes with time, and represents the effect of covariates. can be interpreted as the hazard function when all covariates are ignored or when , and is also called the baseline hazard function. The hazard ratio of two individuals with different covariates and is
which is a constant, independent of time.
The Cox (1972) proportional hazard model assumes that is an exponential function of the covariates, that is,
The hazard function is:
where denotes the coefficients of covariates. These coefficients can be estimated from the data observed and indicate the magnitude of the effects of their corresponding covariates.
The hazard ratio of the patient receiving the experimental drug and the one receiving placebo is
Thus, the two treatments are equally effective if and the experimental drug introduces lower (higher) risk for survival than placebo if .
5.任务
5.1.聚类
数据样本数不到500,数据维度大概在20以内。
目的是为了找到一些样本所具有的的共同点并得到label,从而为新样本的分类提供依据。(得到score)
由于分类数目未知,我认为这里使用Hierarchical Cluster算法比较合适,最终聚类数量应该在10-20左右。使用变量的相关性分析实现降维。可以使用雷达图来对每个样本的变量进行可视化以及打分。
这里有一些问题:如何进行数据预处理(标准化等)?如果不对数据进行预处理,假设我们采用K-Means的欧氏距离进行聚类,最后的聚类结果容易坍缩成某一个或者关于少数几个变量的聚类,即我们需要使用一种方法来调节变量的权重,或者使用其他的距离,否则聚类结果不符合实际情况。可以肯定的一点是,生存时长的变量应该是聚类结果的主要影响因子。
如何利用好survival数据,survival数据中存在有CENSORED DATA,如何对这些数据进行分析?
5.2.回归
回归的目的是找到生存时长与那些因素有关。或者说是score与哪些因素有关。
参考使用的模型是Bayes模型和Cox Model模型。
主要问题:回归方式?使用哪些参数来衡量模型的好坏?
参考资料:
[1]Lee, Elisa T., and John Wang. Statistical methods for survival data analysis. Vol. 476. John Wiley & Sons, 2003.