【Advanced in Financial ML】Cross-Validation

文章内容来源于<Advanced in Financial ML>一书中的第七章

A Quick RecOf CV

CV splits observations drawn from an IID process into two sets: the training set and the testing set. Each observation in the complete dataset belongs to one, and only one set. This is done as to prevent leakage from one set into the other, since that would defeat the purpose of testing on unseen data.

k-fold CV (k=5)

The dataset is partitioned into k subsets
For i=1,…,k
- The ML algorithm is trained on all subsets excluding i
- The fitted ML algorithm is tested on i

In finance, CV is typically used in two settings: model development (like hyperparameter tuning) and backtesting. Backtesting is a complex subject that we will discuss thoroughly in Chapters 10-16. In this chapter, we will focus on CV for model development.

Why K-fold CV Fails In Finance

Leakage takes place when the training set contains information that also appears in the testing set.

One reason k-fold CV fails in finance is because observations cannot be assumed to be drawn from an IID (independent and identically distributed) process

Because of the serial correlation, Xt ≈ Xt+1
Because labels are derived from overlapping datapoints, Yt ≈ Yt+1

By placing t and t+1 in different sets, information is leaked.

When a classifier is first trained on (Xt, Yt), and then it is asked to predict E[Yt+1|Xt+1] based on an
observed Xt+1, this classifier is more likely to achieve Yt+1 = E[Yt+1|Xt+1] even if X is an irrelevant feature.

A second reason is that the testing set is used multiple times in the process of developing a model, leading to multiple testing and selection bias

总结来说就是，交叉验证需要保证测试集的数据是未知的，这样才可以检验在训练集上训练得到的模型效果。一旦测试集的数据用到了训练集的信息，就发生了“数据泄露”现象。

交叉验证用在金融领域会出现数据泄露现象，这是因为由于时间序列的相关性，导致Xt ≈ Xt+1，同时由于Y值通常采样于重叠的数据点（比如滑动平均等等），所以也会出现 Yt ≈ Yt+1。这样Xt→Yt和 Xt+1→Yt+1也是相似的，导致我们在测试集的数据用到了训练集的信息。

How To Reduce The Leakage

Drop from the training set any observation i where Yi is a function of information used to determine Yj, and j belongs to the testing set
Avoid overfitting the classifier. In this way, even if some leakage occurs, the classifier will not be able to profit from it. Use:
- Early stopping of the base estimators
- Bagging of classifiers, while controlling for oversampling on redundant examples, so that the individual classifiers are as diverse as possible
  - Set max_samples to the average uniqueness
  - Apply sequential bootstrap

为了避免上述现象，通常有两种解决方案。

一是从训练集中删除掉测试集中存在的信息。

二是防止模型过拟合，因为如果模型没有发生过拟合，也就不存在从泄露的信息中偷看的能力。

Purged K-Fold Cross-Validation

Purging: one way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set
Embargo: since financial features often incorporate series that exhibit serial correlation (like ARMA processes), we should eliminate from the training set observations that immediately follow an observation in the testing set

针对第一种方法（删除信息），作者提出一种解决方案：Purged K-Fold Cross-Validation。这种方法包括两个操作：purging和emarbgo

这两个方法都是删除数据，但删除的是不同情况下的数据泄露

Purging

Suppose a testing observation whose label Yj is decided based on the information set Φj. In order to prevent the type of leakage described in the previous section, we would like to purge from the training set any observation whose label Yi is decided based on the information set Φi, such that Φi ∩ Φj = ∅

For example, consider a label Yj that is a function of observations in the closed range t ∈ [tj,0, tj,1], Yj = f[[tj,0, tj,1]]. A label Yi = f[[ti,0, ti,1]] overlaps with Yj if any of the three sufficient conditions is met :

tj,0 ≤ ti,0 ≤ tj,1
tj,0 ≤ ti,1 ≤ tj,1
ti,0 ≤ tj,0 ≤ tj,1 ≤ ti,1

假设相邻的点Yi 和Yj采样于重叠的数据点，其中Yj的时间索引是从tj,0到tj,1，Yi的时间索引是从ti,0到ti,1，那么有上述三种可能的情况，我们需要删除这三种情况下泄露的数据：

def getTrainTimes(t1, testTimes):
    """
    Given testTimes, find the times of the training observations.
    - t1.index: Time when the observation started.
    - t1.value: Time when the observation ended.
    - testTimes: Times of testing observations.
    """
    trn = t1.copy(deep=True)
    for i, j in testTimes.iteritems():
        df0 = trn[(i<=trn.index)&(trn.index<=j)].index
        df1 = trn[(i<=trn)&(trn<=j)].index
        df2 = df2=trn[(trn.index<=i)&(j<=trn)].index
        trn=trn.drop(df0.union(df1).union(df2))
    return trn

当发生数据泄漏时，模型的性能可以通过增加k趋近于T来提高，其中T是样本数。原因是k折交叉验证中的k越多，训练集中的重叠观察的次数就越多。因此我们也可以通过增加k来观测是否出现数据泄露i情况。

Embargo

因为时间序列具有时序相关性，train最开始的一段数据可能会用到test的信息，但是test的数据应该保证是未知的。因此我们需要删除掉紧跟在test后的train最开始的数据。

我们不需要考虑test最开始的数据点是否用到了train的信息，因为train的信息本身就是可以获得的。

一般来说我们会设置embargo period h =0.01T，这个T是自己选择的，可以通过增加k来观测模型性能，来评估h具体的取值。

def getEmbargoTimes(times, pctEmbargo):
    """
    Get embargo time for each bar
    """
    step = int(times.shape[0]*pctEmbargo)
    if step == 0:
        mbrg = pd.Series(times, index=times)
    else:
        mbrg = pd.Series(times[step:], index=times[:-step])
        mbrg = mbrg.append(pd.Series(times[-1], index=times[-step:]))
    return mbrg

一些课后习题：

7.1 Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?

shuffling后数据被打乱，相邻的近似点会出现在训练集和测试集，导致测试集用到了训练集的信息

7.2 Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.
(a) Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.
(b) Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.
(c) Why are both results so different?
(d) How does shuffling leak information?

见7.1

7.3 Take the same pair of matrices (X, y) you used in exercise 2.
(a) Derive the performance from a 10-fold purged CV of an RF on (X, y), with 1% embargo.
(b) Why is the performance lower?
(c) Why is this result more realistic?

因为测试集中删除了泄露信息的数据

7.4 In this chapter we have focused on one reason why k-fold CV fails in financial applications, namely the fact that some information from the testing set leaks into the training set. Can you think of a second reason for CV’s failure?

模型选择偏差？

7.5 Suppose you try one thousand configurations of the same investment strategy, and perform a CV on each of them. Some results are guaranteed to look good, just by sheer luck. If you only publish those positive results, and hide the rest, your audience will not be able to deduce that these results are false positives, a statistical fluke. This phenomenon is called “selection bias.”
(a) Can you imagine one procedure to prevent this?
(b) What if we split the dataset in three sets: training, validation, and testing? The validation set is used to evaluate the trained parameters, and the testing is run only on the one configuration chosen in the validation phase. In what case does this procedure still fail?
(c) What is the key to avoiding selection bias?

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 217,907评论 6赞 506
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,987评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,298评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,586评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,633评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,488评论 1赞 302
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,275评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,176评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,619评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,819评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,932评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,655评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,265评论 3赞 329
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,871评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,994评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,095评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,884评论 2赞 354