关于Cox Model的一些说明

1.CENSORED DATA

Many researchers consider survival data analysis to be merely the application of two conventional statistical methods to a special type of problem: parametric if the distribution of survival times is known to be normal and nonparametric if the distribution is unknown.

This assumption would be true if the survival times of all the subjects were exact and known; however, some survival times are not.

Further, the survival distribution is often skewed, or far from being normal. Thus there is a need for new statistical techniques.

2.三种类型的Censoring

2.1.Type I Censoring

场景:Because of time and/or cost limitations, the researcher often cannot wait for the death of all the animals. One option is to observe for a fixed period of time, say six months, after which the surviving
animals are sacrificed.

定义:
Survival times recorded for the animals that died during the study period are the times from the start of the experiment to their death. These are called exact or uncensored observations.

The survival times of the sacrificed animals are not known exactly but are recorded as at least the length
of the study period. These are called censored observations. Some animals could be lost or die accidentally. Their survival times, from the start of experiment to loss or death, are also censored observations.

In type I censoring, if there are no accidental losses, all censored observations equal the length of the study period.

2.2.Type II Censoring

场景:Another option in animal studies is to wait until a fixed portion of the animals have died, say 80 of 100, after which the surviving animals are sacrificed.

定义:
In this case, type II censoring, if there are no accidental losses, the censored observations equal the largest uncensored observation.

2.3.Type III Censoring

场景:In most clinical and epidemiologic studies the period of study is fixed and patients enter the study at different times during that period. Some may die before the end of the study; their exact survival times are known. Others may withdraw before the end of the study and are lost to follow-up. Still others may be alive at the end of the study.

定义:
For ‘‘lost’’ patients, survival times are at least from their entrance to the last contact. For patients still alive, survival times are at least from entry to the end of the study. The latter two kinds of observations are censored observations. Since the entry times are not simultaneous, the censored times are also different. This is type III censoring.

2.4.补充

Type I and type II censored observations are also called singly censored data, and type III, progressively censored data, by Cohen (1965). Another commonly used name for type III censoring is random censoring. All of these types of censoring are right censoring or censoring to the right. There are also left censoring and interval censoring cases. Left censoring occurs when it is known that the event of interest occurred prior to a certain time t, but the exact time of occurrence is unknown.

Interval censoring occurs when the event of interest is known to have occurred between times a and b.

3.三个主要函数

3.1.Survival Function

定义:

Survival Function is defined as the probability that an individual survives longer than t:

\begin{aligned} S(t) &=P(\text { an individual survives longer than } t) =P(T \geq t) \end{aligned}

For cumulative distribution function F(t) of T,

\begin{aligned} S(t) &=1-P(\text { an individual fails before } t)=1-F(t) \end{aligned}

估计方法:

In practice, if there are no censored observations, the survivorship function is estimated as the proportion of patients surviving longer than t :

\hat{S}(t)=\frac{\text { number of patients surviving longer than } t}{\text { total number of patients }}

Where the circumflex denotes an estimate of the function.When censored observations are present, the numerator cannot always be determined.

3.2.Density Function

定义:

The survival time T has a probability density function defined as the limit of the probability that an individual fails in the short interval t to t+\Delta t per unit width \Delta t, or simply the probability of failure in a small interval per unit time.

f(t)=\frac{\lim _{\Delta t \rightarrow 0} P[\text { an individual dying in the interval }(t, t+\Delta t)]}{\Delta t}

性质:

  1. f (t) is a nonnegative function.

  2. The area between the density curve and the t axis is equal to 1.

估计方法:

In practice, if there are no censored observations, the probability density function f (t) is estimated as the proportion of patients dying in an interval per unit width.

\hat{f}(t)=\frac{\text { number of patients dying in the interval beginning at time } t}{(\text { total number of patients }) \times(\text { interval width })}

When censored observations are present, it is not applicable.The density function is also known
as the unconditional failure rate.

3.3.Hazard Function

定义:

The hazard function h(t) of survival time T gives the conditional failure rate. This is defined as the probability of failure during a very small time interval, assuming that the individual has survived to the beginning of the interval, or as the limit of the probability that an individual fails in a very short interval, t+ \Delta t, given that the individual has survived to time t:

h(t)=\frac{\lim _{\Delta t \rightarrow 0} P\left[\begin{array}{c}{\text { an individual fails in the time interval }(t, t+\Delta t)} \\ {\text { given the individual has survived to } t}\end{array}\right.}{\Delta t}

The hazard function can also be defined in terms of the cumulative distribution function F(t) and the probability density function f (t):

h(t)=\frac{f(t)}{1-F(t)}

The hazard function is also known as the instantaneous failure rate, force of
mortality
, conditional mortality rate, and age-specific failure rate.

It gives the risk of failure per unit time during the aging process.

The cumulative hazard function is defined as:

H(t)=\int_{0}^{t} h(x) d x

估计方法:

In practice, when there are no censored observations the hazard function is estimated as the proportion of patients dying in an interval per unit time, given that they have survived to the beginning of the interval:

\hat{h}(t)=\frac{\text { number of patients dying in the interval beginning at time } t}{\text { (number of patients surviving at } t ) \times(\text { interval width })}

=\frac{\text { number of patients dying per unit time in the interval }}{\text { number of patients surviving at } t}

Actuaries usually use the average hazard rate of the interval in which the number of patients dying per unit time in the interval is divided by the average number of survivors at the midpoint of the interval:

\hat{h}(t)=\frac{\text { number of patients dying per unit time in the interval }}{\text { (number of patients surviving at } t )-\text { (number of deaths in the interval) } / 2}

3.4.三个主要函数的内在联系

1.h(t),f(t),S(t):
h(t)=\frac{f(t)}{S(t)}

2.f(t),S(t):
f(t)=\frac{d}{d t}[1-S(t)]=-S^{\prime}(t)

3.h(t),S(t):
h(t)=-\frac{S^{\prime}(t)}{S(t)}=-\frac{d}{d t} \log S(t)

4.H(t),h(t),S(t):
-\int_{0}^{t} h(x) d x=\log S(t)
H(t)=-\log S(t)
S(t)=\exp [-H(t)]=\exp \left[-\int_{0}^{t} h(x) d x\right]

5.f(t),h(t),H(t):
f(t)=h(t) \exp [-H(t)]

4.Cox Proportional Hazards Model

4.1.背景

However, in practice, the exact form of the underlying survival distribution is usually unknown and we may not be able to find an appropriate model. Therefore, the use of parametric methods in identifying significant prognostic factors is somewhat limited.

The Cox (1972) proportional hazards model does not require knowledge of the underlying distribution. The hazard function in this model can take on any form, including that of a stepfunction, but the hazard functions of different individuals are assumed to be proportional and independent of time. The usual likelihood function is replaced by the partial likelihood function. The important fact is that the statistical inference based on the partial likelihood function is similar to that based on the likelihood function.

4.2.方法

This property implies that the hazard function given a set of covariates \mathbf{x}=\left(x_{1}, x_{2}, \ldots, x_{p}\right)^{\prime} can be written as a function of an underlying hazard function and a function, say g\left(x_{1}, \ldots, x_{p}\right), of only the covariates, that is,

h\left(t | x_{1}, \ldots, x_{p}\right)=h_{0}(t) g\left(x_{1}, \ldots, x_{p}\right) \quad \text { or } \quad h(t | \mathbf{x})=h_{0}(t) g(\mathbf{x})

The underlying hazard function, h_{0}(t), represents how the risk changes with time, and g(\mathbf{x}) represents the effect of covariates. h_{0}(t)can be interpreted as the hazard function when all covariates are ignored or when g(\mathbf{x})=1, and is also called the baseline hazard function. The hazard ratio of two individuals with different covariates \mathbf{x_1} and \mathbf{x_2} is

\frac{h\left(t | \mathbf{x}_{1}\right)}{h\left(t | \mathbf{x}_{2}\right)}=\frac{h_{0}(t) g\left(\mathbf{x}_{1}\right)}{h_{0}(t) g\left(\mathbf{x}_{2}\right)}=\frac{g\left(\mathbf{x}_{1}\right)}{g\left(\mathbf{x}_{2}\right)}

which is a constant, independent of time.

The Cox (1972) proportional hazard model assumes thatg(\mathbf{x}) is an exponential function of the covariates, that is,

g(x)=\exp \left(\sum_{j=1}^{p} b_{j} x_{j}\right)=\exp \left(\mathbf{b}^{\prime} \mathbf{x}\right)

The hazard function is:

h(t | \mathbf{x})=h_{0}(t) \exp \left(\sum_{j=1}^{p} b_{j} x_{j}\right)=h_{0}(t) \exp \left(\mathbf{b}^{\prime} \mathbf{x}\right)

where \mathbf{b}=\left(b_{1}, \ldots, b_{p}\right) denotes the coefficients of covariates. These coefficients can be estimated from the data observed and indicate the magnitude of the effects of their corresponding covariates.

The hazard ratio of the patient receiving the experimental drug and the one receiving placebo is

\frac{h\left(t | x_{1}=1\right)}{h\left(t | x_{1}=0\right)}=\exp \left(b_{1}\right)

Thus, the two treatments are equally effective if b_1=0 and the experimental drug introduces lower (higher) risk for survival than placebo if b_1<0 (b_1 > 0).

5.任务

5.1.聚类

数据样本数不到500,数据维度大概在20以内。

目的是为了找到一些样本所具有的的共同点并得到label,从而为新样本的分类提供依据。(得到score)

由于分类数目未知,我认为这里使用Hierarchical Cluster算法比较合适,最终聚类数量应该在10-20左右。使用变量的相关性分析实现降维。可以使用雷达图来对每个样本的变量进行可视化以及打分。

这里有一些问题:如何进行数据预处理(标准化等)?如果不对数据进行预处理,假设我们采用K-Means的欧氏距离进行聚类,最后的聚类结果容易坍缩成某一个或者关于少数几个变量的聚类,即我们需要使用一种方法来调节变量的权重,或者使用其他的距离,否则聚类结果不符合实际情况。可以肯定的一点是,生存时长的变量应该是聚类结果的主要影响因子。
如何利用好survival数据,survival数据中存在有CENSORED DATA,如何对这些数据进行分析?

5.2.回归

回归的目的是找到生存时长与那些因素有关。或者说是score与哪些因素有关。

参考使用的模型是Bayes模型和Cox Model模型。

主要问题:回归方式?使用哪些参数来衡量模型的好坏?


参考资料

[1]Lee, Elisa T., and John Wang. Statistical methods for survival data analysis. Vol. 476. John Wiley & Sons, 2003.

[2]Kurtz, David M., et al. "Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction." Cell (2019).

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,324评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,303评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,192评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,555评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,569评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,566评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,927评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,583评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,827评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,590评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,669评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,365评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,941评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,928评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,159评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,880评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,399评论 2 342

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,279评论 0 10
  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 9,355评论 0 23
  • 勤劳是必须的,天上不会掉馅饼。不靠双手去劳作,谁给你工资,又拿什么去换回吃的,穿的,用的。 羡慕人家的辉煌,先看看...
    最早的花拖鞋阅读 1,353评论 16 77
  • 在页面A中设置一个定时器,打印一个a,然后跳转到B页面,这个定时器还一直在执行,这样是非常耗性能的。解决思路:使用...
    壹二叁阅读 8,513评论 0 1
  • 笔赋, 两个简单的符号, 画出了一串留住心灵的数字--119697965, 一个港湾, 给流浪的心静静休憩。 如果...
    伊凡轩阅读 416评论 0 0