Kaggle time series tutorial solution

Notes and solutions for the time series tutorial: Linear Regression With Time Series.

Takeaways

This is not really an interesting tutorial, because it is too simple. But I'll list some take-aways.

Components of time series

  • Time dependent component: trend and seasonality
    • Moving average for trend
    • Fourier transform for seasonality
  • Serially dependent component: cycles
    • Autocorrelation for cycles

Winners of Kaggle forecasting competitions have often included moving averages and other rolling statistics in their feature sets. Such features seem to be especially useful when used with GBDT algorithms like XGBoost.

Many time series can be closely described by an additive model of just these three components plus some essentially unpredictable, entirely random error:

\text{series} = \text{trend} + \text{seasons} + \text{cycles} + \text{error}

Learn a hybrid model

We could imagine learning the components of a time series as an iterative process: first learn the trend and subtract it out from the series, then learn the seasonality from the detrended residuals and subtract the seasons out, then learn the cycles and subtract the cycles out, and finally only the unpredictable error remains.

Learning the components of Mauna Loa CO2 step by step. Subtract the fitted curve (blue) from its series to get the series in the next step

There are generally two ways a regression algorithm can make predictions: either by transforming the features or by transforming the target.

  • Feature-transforming algorithms learn some mathematical function that takes features as an input and then combines and transforms them to produce an output that matches the target values in the training set.

    • Examples: linear regression and neural networks.
    • Can extrapolate target values beyond training set.
  • Target-transforming algorithms use the features to group the target values in the training set and make predictions by averaging values in a group; a set of feature just indicates which group to average.

    • Examples: Decision trees and nearest neighbors.
    • Can not extrapolate, so can not predict the trend.

As a result, we need a "hybrid" model, which uses feature-transforming algorithm (e.g., linear regression) to extrapolate the trend, and then apply a target-transforming algorithm (e.g., XGBoost) to model the de-trended residuals.

1 Linear Regression With Time Series

Use two features unique to time series: lags and time steps.

1.1 Interpret linear regression with the time dummy

Hardcover book sales

The linear regression line has an equation of (approximately) Hardcover = 3.33 * Time + 150.5. Over 6 days how much on average would you expect hardcover sales to change?

Easy, it shall be 3.33 * 6 = 19.98.

1.2 Interpret linear regression with a lag feature

image.png

One of these series has the equation target = 0.95 * lag_1 + error and the other has the equation target = -0.95 * lag_1 + error, differing only by the sign on the lag feature. Can you tell which equation goes with each series?

In the first graph, neighbor points tend to have the same sign, so it shall be target = 0.95 * lag_1 + error

1.3 Fit a time-step feature

Complete the code below to create a linear regression model with a time-step feature on the series of average product sales. The target is in a column called 'sales'.

from sklearn.linear_model import LinearRegression

df = average_sales.to_frame()

# YOUR CODE HERE: Create a time dummy
time = range(0, len(average_sales))

df['time'] = time 

# YOUR CODE HERE: Create training data
X = df[['time']]  # features
y = df.sales  # target

# Train the model
model = LinearRegression()
model.fit(X, y)

# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)


# Check your answer
q_3.check()

First, we create a time dummy with range, and then use it as feature and pass to LinearRegression module.
Note that the feature must be a data frame, not a series. So use df[['time']] instead of df.time.

1.4 Fit a lag feature to Store Sales

Complete the code below to create a linear regression model with a lag feature on the series of average product sales. The target is in a column of df called 'sales'.

df = average_sales.to_frame()

# YOUR CODE HERE: Create a lag feature from the target 'sales'
lag_1 = df.sales.shift(1)

df['lag_1'] = lag_1  # add to dataframe

X = df.loc[:, ['lag_1']].dropna()  # features
y = df.loc[:, 'sales']  # target
y, X = y.align(X, join='inner')  # drop corresponding values in target

# YOUR CODE HERE: Create a LinearRegression instance and fit it to X and y.
model = LinearRegression()
model.fit(X, y)

# YOUR CODE HERE: Create Store the fitted values as a time series with
# the same time index as the training data
y_pred = pd.Series(model.predict(X), index=X.index)


# Check your answer
q_4.check()

Easy, it is a complement of 2.3, emphasizes the usage of LinearRegression.

2 Trend

Model long-term changes with moving averages and the time dummy.

2.1 Determine trend with a moving average plot

The US Retail Sales dataset contains monthly sales data for a number of retail industries in the United States. Make a moving average plot to estimate the trend for this series.

US Food and Beverage Sales

We need to use pandas.DataFrame.rolling method to get the "moving average". It has 3 important arguments:

  • window: size of the moving window. In this case, since it's monthly data, we can set the window to a year (12 months)
  • center: whether set the labels at the center of the window.
  • min_periods: Minimum number of observations in window required to have a value (otherwise result is NA). If we don't want NAs in the result, since we have "center" set to True, in the begin and end, we need at least 6 data points to generate a value.
# YOUR CODE HERE: Add methods to `food_sales` to compute a moving
# average with appropriate parameters for trend estimation.
trend = food_sales.rolling(12, center=True, min_periods=6).mean()
# Check your answer
q_1.check()

# Make a plot
ax = food_sales.plot(**plot_params, alpha=0.5)
ax = trend.plot(ax=ax, linewidth=3)
moving average

2.2 Identify trend

What order polynomial trend might be appropriate for the Food and Beverage Sales series? Can you think of a non-polynomial curve that might work even better?

Intuitively, because of the upwards bend, an order 2 polynomial might be appropriate.
(If you've worked with economic time series before, you might guess that the growth rate in Food and Beverage Sales is best expressed as a percent change. Percent change can often be modeled using an exponential curve. )

2.3 Create a Trend Feature

Use DeterministicProcess to create a feature set for a cubic trend model. Also create features for a 90-day forecast.

from statsmodels.tsa.deterministic import DeterministicProcess

y = average_sales.copy()  # the target

# YOUR CODE HERE: Instantiate `DeterministicProcess` with arguments
# appropriate for a cubic trend model
dp = DeterministicProcess(
    index=average_sales.index,  # dates from the training data
    order=3,             # the time dummy (trend)
    drop=True,           # drop terms if necessary to avoid collinearity
)

# YOUR CODE HERE: Create the feature set for the dates given in y.index
X = dp.in_sample()

# YOUR CODE HERE: Create features for a 90-day forecast.
X_fore = dp.out_of_sample(steps=90)


# Check your answer
q_3.check()

A deterministic process is a technical term for a time series that is non-random or completely determined, like the const and trend series are. Features derived from the time index will generally be deterministic

To make a forecast, we apply our model to out_of_sample features. "Out of sample" refers to times outside of the observation period of the training data.

Note that this is not the model. The comment has made it clear: "Create features for...". The X is just features:

DeterministicProcess only creates features

trend = time dummy
trend squared = (time dummy)^2
trend cubed = (time dummy)^3

We can use LinearRegression after DeterministicProcess to obtain the model:

model = LinearRegression()
model.fit(X, y)

y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, linewidth=3, label="Trend", color='C0')
ax = y_fore.plot(ax=ax, linewidth=3, label="Trend Forecast", color='C3')
ax.legend();
linear model

2.4 Understand risks of forecasting with high-order polynomials

Overfitting.

3 Seasonality

Create indicators and Fourier features to capture periodic change.

3.1 Determine seasonality

What kind of seasonality do you see evidence of?

periodogram

It suggests a strong weekly seasonality. It appears there may be some biweekly components as well. The background is that it is a store sales data, wages in the public sector are paid out biweekly.

The periodogram tells you the strength of the frequencies in a time series. Specifically, the value on the y-axis of the graph is (a^2 + b^2) / 2, where a and b are the coefficients of the sine and cosine at that frequency after a Fourier transform.

3.2 Create seasonal features

Use DeterministicProcess and CalendarFourier to create:
- indicators for weekly seasons and
- Fourier features of order 4 for monthly seasons.

y = average_sales.copy()

# YOUR CODE HERE
fourier = CalendarFourier(freq="M", order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

# Check your answer
q_2.check()

The task breaks into two parts. seasonal=True will generate "indicators for weekly seasons" in features (X), and CalendarFourier(freq="M", order=4) generates "Fourier features of order 4 for monthly seasons".

['const', 'trend', 
's(2,7)', 's(3,7)', 's(4,7)', 's(5,7)', 's(6,7)', 's(7,7)',
'sin(1,freq=M)', 'cos(1,freq=M)', 'sin(2,freq=M)', 'cos(2,freq=M)', 'sin(3,freq=M)', 'cos(3,freq=M)', 'sin(4,freq=M)', 'cos(4,freq=M)']

s(?, 7) is an encoding of weekday.

s(?, 7) example

For each i=1, …, order, sin(i, freq=M) can be calculated by:
\sin (2\pi i \tau)

Where \tau is the proportion of time in a month. e.g.:

sin(i, freq=M) example

The first day is the origin (0, 0). The second is 1/31 month.

With these features, we can fit our linear model with code below:

model = LinearRegression().fit(X, y)
y_pred = pd.Series(
    model.predict(X),
    index=X.index,
    name='Fitted',
)
seasonal fit

We can remove the seasonal components from the sales, which is called deseasonalizing.

periodogram of the deseasonalized series

3.3 Check for remaining seasonality

Based on these periodograms, how effectively does it appear your model captured the seasonality in Average Sales? Does the periodogram agree with the time plot of the deseasonalized series?

The model was able to capture the seasonal variation in average sales, but it lacks the "peak" values (which are holidays, we'll discuss later).

If we plot the deseasonalized graph with holidays, we get below graph:

National and Regional Holidays

This means the seasonality can not capture the holiday features.

3.4 Create holiday features

What kind of features could you create to help your model make use of holiday information?

# YOUR CODE HERE
X_holidays = pd.get_dummies(holidays)

X2 = X.join(X_holidays, on='date').fillna(0.0)

# Check your answer
q_4.check()

We use pd.get_dummies to encode holidays with one-hot encoding, and use them as extra features by merging them on date.

After that, we train the linear model with the extra holiday features:

model = LinearRegression().fit(X2, y)
y_pred = pd.Series(
    model.predict(X2),
    index=X2.index,
    name='Fitted',
)
predict sales with holiday features

4 Time Series as Features

Predict the future from the past with a lag embedding.

Not every product family has sales showing cyclic behavior, and neither does the series of average sales. Sales of magazines, however, show patterns of growth and decay not well characterized by trend or seasons. In this question and the next, you'll model cycles in magazine sales using lag features.

Trend and seasonality will both create serial dependence that shows up in correlograms and lag plots. To isolate any purely cyclic behavior, we'll start by deseasonalizing the series.

magazine sales

4.1 Plotting cycles

Does this deseasonalized series show cyclic patterns? To confirm our intuition, we can try to isolate cyclic behavior using a moving-average plot just like we did with trend. The idea is to choose a window long enough to smooth over short-term seasonality, but short enough to still preserve the cycles.

Create a seven-day moving average from y, the series of magazine sales. Use a centered window, but don't set the min_periods argument.

# YOUR CODE HERE
y_ma = y.rolling(7, center=True).mean()


# Plot
ax = y_ma.plot()
ax.set_title("Seven-Day Moving Average");
moving average

4.2 Examine serial dependence in Store Sales

Are any of the lags significant according to the correlogram? Does the lag plot suggest any relationships that weren't apparent from the correlogram?

partial autocorrelation and lag plot

We can see the lag 1 and lag 6 are significant. The lag plot suggests some non-linear effect.

4.3 Examine the time series features

Recall from the tutorial that a leading indicator is a series whose values at one time can be used to predict the target at a future time -- a leading indicator provides "advance notice" of changes in the target.

The competition dataset includes a time series that could potentially be useful as a leading indicator -- the onpromotion series, which contains the number of items on a special promotion that day. Since the company itself decides when to do a promotion, there's no worry about "lookahead leakage"; we could use Tuesday's onpromotion value to forecast sales on Monday, for instance.

Does it appear that either leading or lagging values of onpromotion could be useful as a feature?

on promotion

The lag plot indicates that both leading and lagged values of on promotion are correlated to magazine sales.

TODO: explain how it determines the on promotion values useful.

4.4 Create time series features

Create the features indicated in the solution to Question 3. If no features from that series would be useful, use an empty dataframe pd.DataFrame() as your answer.

# YOUR CODE HERE: Make features from `y_deseason`
X_lags = make_lags(y_deseason, lags=1)

# YOUR CODE HERE: Make features from `onpromotion`
# You may want to use `pd.concat`
X_promo = pd.concat([make_lags(onpromotion, lags=1), onpromotion, make_leads(onpromotion, leads=1)], axis=1)

# YOUR CODE HERE: Make features from `oil`
X_oil = pd.DataFrame()
# pd.read_csv(comp_dir / 'oil.csv', parse_dates=['date'], infer_datetime_format=True).set_index('date')

X = pd.concat([X_lags, X_promo, X_oil], axis=1).dropna()
y, X = y.align(X, join='inner')

# Check your answer
q_4.check()

4.5 Create statistical features

Computing rolling statistics to be used as features is similar except we need to take care to avoid lookahead leakage.

  • First, the result should be set at the right end of the window instead of the center -- that is, we should use center=False (the default) in the rolling method.
  • Second, the target should be lagged a step.

Edit the code in the next cell to create the following features:
- 14-day rolling median (median) of lagged target
- 7-day rolling standard deviation (std) of lagged target
- 7-day sum (sum) of items "on promotion", with centered window

y_lag = mag_sales.loc[:, 'sales'].shift(1)
onpromo = mag_sales.loc[:, 'onpromotion']

# 28-day mean of lagged target
mean_7 = y_lag.rolling(7).mean()
# YOUR CODE HERE: 14-day median of lagged target
median_14 = y_lag.rolling(14).median()
# YOUR CODE HERE: 7-day rolling standard deviation of lagged target
std_7 = y_lag.rolling(7).std()
# YOUR CODE HERE: 7-day sum of promotions with centered window
promo_7 = onpromo.rolling(7, center=True).sum()


# Check your answer
q_5.check()

We don't lag the onpromo because the company itself decides when to do a promotion, there's no worry about "lookahead leakage"

5 Hybrid Models

Combine the strengths of two forecasters with this powerful technique.

# You'll add fit and predict methods to this minimal class
class BoostedHybrid:
    def __init__(self, model_1, model_2):
        self.model_1 = model_1
        self.model_2 = model_2
        self.y_columns = None  # store column names from fit method

5.1 Define fit method for boosted hybrid

Complete the fit definition for the BoostedHybrid class.

def fit(self, X_1, X_2, y):
    # YOUR CODE HERE: fit self.model_1
    self.model_1.fit(X_1, y)

    y_fit = pd.DataFrame(
        # YOUR CODE HERE: make predictions with self.model_1
        self.model_1.predict(X_1),
        index=X_1.index, columns=y.columns,
    )

    # YOUR CODE HERE: compute residuals
    y_resid = y - y_fit
    y_resid = y_resid.stack().squeeze() # wide to long

    # YOUR CODE HERE: fit self.model_2 on residuals
    self.model_2.fit(X_2, y_resid)

    # Save column names for predict method
    self.y_columns = y.columns
    # Save data for question checking
    self.y_fit = y_fit
    self.y_resid = y_resid


# Add method to class
BoostedHybrid.fit = fit


# Check your answer
q_1.check()

Follow steps below:

  1. train model1 with feature X1, target y
  2. make prediction y1 with model1
  3. compute residuals with y - y1
  4. train model2 with feature X2, target y - y1

5.2 Define predict method for boosted hybrid

Now define the predict method for the BoostedHybrid class.

def predict(self, X_1, X_2):
    y_pred = pd.DataFrame(
        # YOUR CODE HERE: predict with self.model_1
        self.model_1.predict(X_1),
        index=X_1.index, columns=self.y_columns,
    )
    y_pred = y_pred.stack().squeeze()  # wide to long

    # YOUR CODE HERE: add self.model_2 predictions to y_pred
    y_pred += self.model_2.predict(X_2)
    
    return y_pred.unstack()  # long to wide


# Add method to class
BoostedHybrid.predict = predict


# Check your answer
q_2.check()

Follow steps below:

  1. make prediction y1 with feature X1
  2. make prediction y2 with feature X2
  3. add y1 and y2

5.3 Train boosted hybrid

Create the hybrid model by initializing a BoostedHybrid class with LinearRegression() and XGBRegressor() instances.

# YOUR CODE HERE: Create LinearRegression + XGBRegressor hybrid with BoostedHybrid
model = BoostedHybrid(LinearRegression(), XGBRegressor())

# YOUR CODE HERE: Fit and predict
model.fit(X_1, X_2, y)
y_pred = model.predict(X_1, X_2)

y_pred = y_pred.clip(0.0)


# Check your answer
q_3.check()

6 Forecasting With Machine Learning

Apply ML to any forecasting task with these four strategies.

6.1 Match description to dataset

Consider the following three forecasting tasks:

a. 3-step forecast using 4 lag features with a 2-step lead time

b. 1-step forecast using 3 lag features with a 1-step lead time

c. 3-step forecast using 4 lag features with a 1-step lead time

answer: 2-1-3

  • The number of forecasting steps is the number of columns under Targets.
  • The number of lags is the number of columns under Features.
  • The lead time is the smallest lag step possible

6.2 Identify the forecasting task for Store Sales competition

Try to identify the forecast origin and the forecast horizon. How many steps are within the forecast horizon? What is the lead time for the forecast?

define a forecast task
  • The "forecast origin" is time at which you are making a forecast. You may consider it to be the last timestamp of your training data.

  • The "forecast horizon" is the time range for which you are making a forecast.

  • The time between the origin and the horizon is the "lead time".

16 steps are within the forecast horizon, leading time is 1.

The training dataset ends on 2017/08/15 (the forecast origin), the test set comprises the dates 2017/08/16 to 2017/08/31, which includes 16 days (the forecast horizon). There is 1 day between the origin and horizon, so the lead time is 1 day.

6.3 Create multistep dataset for store sales

Create targets suitable for the Store Sales forecasting task. Use 4 days of lag features. Drop any missing values from both targets and features.

# YOUR CODE HERE
y = family_sales.loc[:, 'sales']

# YOUR CODE HERE: Make 4 lag features
X = make_lags(y, lags=4).dropna()

# YOUR CODE HERE: Make multistep target
y = make_multistep_target(y, steps=16).dropna()

y, X = y.align(X, join='inner', axis=0)

# Check your answer
q_3.check()

6.4 Forecast with the DirRec strategy

Instatiate a model that applies the DirRec strategy to XGBoost.

DirRec strategy is a combination of the direct and recursive strategies: train a model for each step and use forecasts from previous steps as new lag features.

from sklearn.multioutput import RegressorChain

# YOUR CODE HERE
model = RegressorChain(XGBRegressor())

# Check your answer
q_4.check()
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,185评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,445评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,684评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,564评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,681评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,874评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,025评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,761评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,217评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,545评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,694评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,351评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,988评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,778评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,007评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,427评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,580评论 2 349

推荐阅读更多精彩内容