文章内容来源于<Advanced in Financial ML>一书中的第七章
A Quick RecOf CV
CV splits observations drawn from an IID process into two sets: the training set and the testing set. Each observation in the complete dataset belongs to one, and only one set. This is done as to prevent leakage from one set into the other, since that would defeat the purpose of testing on unseen data.
k-fold CV (k=5)
- The dataset is partitioned into k subsets
- For i=1,…,k
- The ML algorithm is trained on all subsets excluding i
- The fitted ML algorithm is tested on i
In finance, CV is typically used in two settings: model development (like hyperparameter tuning) and backtesting. Backtesting is a complex subject that we will discuss thoroughly in Chapters 10-16. In this chapter, we will focus on CV for model development.
Why K-fold CV Fails In Finance
Leakage takes place when the training set contains information that also appears in the testing set.
One reason k-fold CV fails in finance is because observations cannot be assumed to be drawn from an IID (independent and identically distributed) process
- Because of the serial correlation, Xt ≈ Xt+1
- Because labels are derived from overlapping datapoints, Yt ≈ Yt+1
By placing t and t+1 in different sets, information is leaked.
When a classifier is first trained on (Xt, Yt), and then it is asked to predict E[Yt+1|Xt+1] based on an
observed Xt+1, this classifier is more likely to achieve Yt+1 = E[Yt+1|Xt+1] even if X is an irrelevant feature.
A second reason is that the testing set is used multiple times in the process of developing a model, leading to multiple testing and selection bias
总结来说就是,交叉验证需要保证测试集的数据是未知的,这样才可以检验在训练集上训练得到的模型效果。一旦测试集的数据用到了训练集的信息,就发生了“数据泄露”现象。
交叉验证用在金融领域会出现数据泄露现象,这是因为由于时间序列的相关性,导致Xt ≈ Xt+1,同时由于Y值通常采样于重叠的数据点(比如滑动平均等等),所以也会出现 Yt ≈ Yt+1。这样Xt→Yt和 Xt+1→Yt+1也是相似的,导致我们在测试集的数据用到了训练集的信息。
How To Reduce The Leakage
- Drop from the training set any observation i where Yi is a function of information used to determine Yj, and j belongs to the testing set
- Avoid overfitting the classifier. In this way, even if some leakage occurs, the classifier will not be able to profit from it. Use:
- Early stopping of the base estimators
- Bagging of classifiers, while controlling for oversampling on redundant examples, so that the individual classifiers are as diverse as possible
- Set max_samples to the average uniqueness
- Apply sequential bootstrap
为了避免上述现象,通常有两种解决方案。
一是从训练集中删除掉测试集中存在的信息。
二是防止模型过拟合,因为如果模型没有发生过拟合,也就不存在从泄露的信息中偷看的能力。
Purged K-Fold Cross-Validation
- Purging: one way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set
- Embargo: since financial features often incorporate series that exhibit serial correlation (like ARMA processes), we should eliminate from the training set observations that immediately follow an observation in the testing set
针对第一种方法(删除信息),作者提出一种解决方案:Purged K-Fold Cross-Validation。这种方法包括两个操作:purging和emarbgo
这两个方法都是删除数据,但删除的是不同情况下的数据泄露
Purging
Suppose a testing observation whose label Yj is decided based on the information set Φj. In order to prevent the type of leakage described in the previous section, we would like to purge from the training set any observation whose label Yi is decided based on the information set Φi, such that Φi ∩ Φj = ∅
For example, consider a label Yj that is a function of observations in the closed range t ∈ [tj,0, tj,1], Yj = f[[tj,0, tj,1]]. A label Yi = f[[ti,0, ti,1]] overlaps with Yj if any of the three sufficient conditions is met :
- tj,0 ≤ ti,0 ≤ tj,1
- tj,0 ≤ ti,1 ≤ tj,1
- ti,0 ≤ tj,0 ≤ tj,1 ≤ ti,1
假设相邻的点Yi 和Yj采样于重叠的数据点,其中Yj的时间索引是从tj,0到tj,1,Yi的时间索引是从ti,0到ti,1,那么有上述三种可能的情况,我们需要删除这三种情况下泄露的数据:
def getTrainTimes(t1, testTimes):
"""
Given testTimes, find the times of the training observations.
- t1.index: Time when the observation started.
- t1.value: Time when the observation ended.
- testTimes: Times of testing observations.
"""
trn = t1.copy(deep=True)
for i, j in testTimes.iteritems():
df0 = trn[(i<=trn.index)&(trn.index<=j)].index
df1 = trn[(i<=trn)&(trn<=j)].index
df2 = df2=trn[(trn.index<=i)&(j<=trn)].index
trn=trn.drop(df0.union(df1).union(df2))
return trn
当发生数据泄漏时,模型的性能可以通过增加k趋近于T来提高,其中T是样本数。原因是k折交叉验证中的k越多,训练集中的重叠观察的次数就越多。因此我们也可以通过增加k来观测是否出现数据泄露i情况。
Embargo
因为时间序列具有时序相关性,train最开始的一段数据可能会用到test的信息,但是test的数据应该保证是未知的。因此我们需要删除掉紧跟在test后的train最开始的数据。
我们不需要考虑test最开始的数据点是否用到了train的信息,因为train的信息本身就是可以获得的。
一般来说我们会设置embargo period h =0.01T,这个T是自己选择的,可以通过增加k来观测模型性能,来评估h具体的取值。
def getEmbargoTimes(times, pctEmbargo):
"""
Get embargo time for each bar
"""
step = int(times.shape[0]*pctEmbargo)
if step == 0:
mbrg = pd.Series(times, index=times)
else:
mbrg = pd.Series(times[step:], index=times[:-step])
mbrg = mbrg.append(pd.Series(times[-1], index=times[-step:]))
return mbrg
一些课后习题:
7.1 Why is shuffling a dataset before conducting k-fold CV generally a bad idea in finance? What is the purpose of shuffling? Why does shuffling defeat the purpose of k-fold CV in financial datasets?
shuffling后数据被打乱,相邻的近似点会出现在训练集和测试集,导致测试集用到了训练集的信息
7.2 Take a pair of matrices (X, y), representing observed features and labels. These could be one of the datasets derived from the exercises in Chapter 3.
(a) Derive the performance from a 10-fold CV of an RF classifier on (X, y), without shuffling.
(b) Derive the performance from a 10-fold CV of an RF on (X, y), with shuffling.
(c) Why are both results so different?
(d) How does shuffling leak information?
见7.1
7.3 Take the same pair of matrices (X, y) you used in exercise 2.
(a) Derive the performance from a 10-fold purged CV of an RF on (X, y), with 1% embargo.
(b) Why is the performance lower?
(c) Why is this result more realistic?
因为测试集中删除了泄露信息的数据
7.4 In this chapter we have focused on one reason why k-fold CV fails in financial applications, namely the fact that some information from the testing set leaks into the training set. Can you think of a second reason for CV’s failure?
模型选择偏差?
7.5 Suppose you try one thousand configurations of the same investment strategy, and perform a CV on each of them. Some results are guaranteed to look good, just by sheer luck. If you only publish those positive results, and hide the rest, your audience will not be able to deduce that these results are false positives, a statistical fluke. This phenomenon is called “selection bias.”
(a) Can you imagine one procedure to prevent this?
(b) What if we split the dataset in three sets: training, validation, and testing? The validation set is used to evaluate the trained parameters, and the testing is run only on the one configuration chosen in the validation phase. In what case does this procedure still fail?
(c) What is the key to avoiding selection bias?