Notes and solutions for the time series tutorial: Linear Regression With Time Series.
Takeaways
This is not really an interesting tutorial, because it is too simple. But I'll list some take-aways.
Components of time series
- Time dependent component: trend and seasonality
- Moving average for trend
- Fourier transform for seasonality
- Serially dependent component: cycles
- Autocorrelation for cycles
Winners of Kaggle forecasting competitions have often included moving averages and other rolling statistics in their feature sets. Such features seem to be especially useful when used with GBDT algorithms like XGBoost.
Many time series can be closely described by an additive model of just these three components plus some essentially unpredictable, entirely random error:
Learn a hybrid model
We could imagine learning the components of a time series as an iterative process: first learn the trend and subtract it out from the series, then learn the seasonality from the detrended residuals and subtract the seasons out, then learn the cycles and subtract the cycles out, and finally only the unpredictable error remains.
There are generally two ways a regression algorithm can make predictions: either by transforming the features or by transforming the target.
-
Feature-transforming algorithms learn some mathematical function that takes features as an input and then combines and transforms them to produce an output that matches the target values in the training set.
- Examples: linear regression and neural networks.
- Can extrapolate target values beyond training set.
-
Target-transforming algorithms use the features to group the target values in the training set and make predictions by averaging values in a group; a set of feature just indicates which group to average.
- Examples: Decision trees and nearest neighbors.
- Can not extrapolate, so can not predict the trend.
As a result, we need a "hybrid" model, which uses feature-transforming algorithm (e.g., linear regression) to extrapolate the trend, and then apply a target-transforming algorithm (e.g., XGBoost) to model the de-trended residuals.
1 Linear Regression With Time Series
Use two features unique to time series: lags and time steps.
1.1 Interpret linear regression with the time dummy
The linear regression line has an equation of (approximately)
Hardcover = 3.33 * Time + 150.5
. Over 6 days how much on average would you expect hardcover sales to change?
Easy, it shall be 3.33 * 6 = 19.98
.
1.2 Interpret linear regression with a lag feature
One of these series has the equation
target = 0.95 * lag_1 + error
and the other has the equationtarget = -0.95 * lag_1 + error
, differing only by the sign on the lag feature. Can you tell which equation goes with each series?
In the first graph, neighbor points tend to have the same sign, so it shall be target = 0.95 * lag_1 + error
1.3 Fit a time-step feature
Complete the code below to create a linear regression model with a time-step feature on the series of average product sales. The target is in a column called
'sales'
.
from sklearn.linear_model import LinearRegression
df = average_sales.to_frame()
# YOUR CODE HERE: Create a time dummy
time = range(0, len(average_sales))
df['time'] = time
# YOUR CODE HERE: Create training data
X = df[['time']] # features
y = df.sales # target
# Train the model
model = LinearRegression()
model.fit(X, y)
# Store the fitted values as a time series with the same time index as
# the training data
y_pred = pd.Series(model.predict(X), index=X.index)
# Check your answer
q_3.check()
First, we create a time dummy with range
, and then use it as feature and pass to LinearRegression
module.
Note that the feature must be a data frame, not a series. So use df[['time']]
instead of df.time
.
1.4 Fit a lag feature to Store Sales
Complete the code below to create a linear regression model with a lag feature on the series of average product sales. The target is in a column of
df
called'sales'
.
df = average_sales.to_frame()
# YOUR CODE HERE: Create a lag feature from the target 'sales'
lag_1 = df.sales.shift(1)
df['lag_1'] = lag_1 # add to dataframe
X = df.loc[:, ['lag_1']].dropna() # features
y = df.loc[:, 'sales'] # target
y, X = y.align(X, join='inner') # drop corresponding values in target
# YOUR CODE HERE: Create a LinearRegression instance and fit it to X and y.
model = LinearRegression()
model.fit(X, y)
# YOUR CODE HERE: Create Store the fitted values as a time series with
# the same time index as the training data
y_pred = pd.Series(model.predict(X), index=X.index)
# Check your answer
q_4.check()
Easy, it is a complement of 2.3, emphasizes the usage of LinearRegression
.
2 Trend
Model long-term changes with moving averages and the time dummy.
2.1 Determine trend with a moving average plot
The US Retail Sales dataset contains monthly sales data for a number of retail industries in the United States. Make a moving average plot to estimate the trend for this series.
We need to use pandas.DataFrame.rolling
method to get the "moving average". It has 3 important arguments:
-
window
: size of the moving window. In this case, since it's monthly data, we can set the window to a year (12 months) -
center
: whether set the labels at the center of the window. -
min_periods
: Minimum number of observations in window required to have a value (otherwise result is NA). If we don't want NAs in the result, since we have "center" set toTrue
, in the begin and end, we need at least 6 data points to generate a value.
# YOUR CODE HERE: Add methods to `food_sales` to compute a moving
# average with appropriate parameters for trend estimation.
trend = food_sales.rolling(12, center=True, min_periods=6).mean()
# Check your answer
q_1.check()
# Make a plot
ax = food_sales.plot(**plot_params, alpha=0.5)
ax = trend.plot(ax=ax, linewidth=3)
2.2 Identify trend
What order polynomial trend might be appropriate for the Food and Beverage Sales series? Can you think of a non-polynomial curve that might work even better?
Intuitively, because of the upwards bend, an order 2 polynomial might be appropriate.
(If you've worked with economic time series before, you might guess that the growth rate in Food and Beverage Sales is best expressed as a percent change. Percent change can often be modeled using an exponential curve. )
2.3 Create a Trend Feature
Use
DeterministicProcess
to create a feature set for a cubic trend model. Also create features for a 90-day forecast.
from statsmodels.tsa.deterministic import DeterministicProcess
y = average_sales.copy() # the target
# YOUR CODE HERE: Instantiate `DeterministicProcess` with arguments
# appropriate for a cubic trend model
dp = DeterministicProcess(
index=average_sales.index, # dates from the training data
order=3, # the time dummy (trend)
drop=True, # drop terms if necessary to avoid collinearity
)
# YOUR CODE HERE: Create the feature set for the dates given in y.index
X = dp.in_sample()
# YOUR CODE HERE: Create features for a 90-day forecast.
X_fore = dp.out_of_sample(steps=90)
# Check your answer
q_3.check()
A deterministic process is a technical term for a time series that is non-random or completely determined, like the const and trend series are. Features derived from the time index will generally be deterministic
To make a forecast, we apply our model to out_of_sample
features. "Out of sample" refers to times outside of the observation period of the training data.
Note that this is not the model. The comment has made it clear: "Create features for...". The X
is just features:
trend
= time dummy
trend squared
= (time dummy)^2
trend cubed
= (time dummy)^3
We can use LinearRegression
after DeterministicProcess
to obtain the model:
model = LinearRegression()
model.fit(X, y)
y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)
ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, linewidth=3, label="Trend", color='C0')
ax = y_fore.plot(ax=ax, linewidth=3, label="Trend Forecast", color='C3')
ax.legend();
2.4 Understand risks of forecasting with high-order polynomials
Overfitting.
3 Seasonality
Create indicators and Fourier features to capture periodic change.
3.1 Determine seasonality
What kind of seasonality do you see evidence of?
It suggests a strong weekly seasonality. It appears there may be some biweekly components as well. The background is that it is a store sales data, wages in the public sector are paid out biweekly.
The periodogram tells you the strength of the frequencies in a time series. Specifically, the value on the y-axis of the graph is , where a and b are the coefficients of the sine and cosine at that frequency after a Fourier transform.
3.2 Create seasonal features
Use
DeterministicProcess
andCalendarFourier
to create:
- indicators for weekly seasons and
- Fourier features of order 4 for monthly seasons.
y = average_sales.copy()
# YOUR CODE HERE
fourier = CalendarFourier(freq="M", order=4)
dp = DeterministicProcess(
index=y.index,
constant=True,
order=1,
# YOUR CODE HERE
seasonal=True,
additional_terms=[fourier],
drop=True,
)
X = dp.in_sample()
# Check your answer
q_2.check()
The task breaks into two parts. seasonal=True
will generate "indicators for weekly seasons" in features (X
), and CalendarFourier(freq="M", order=4)
generates "Fourier features of order 4 for monthly seasons".
['const', 'trend',
's(2,7)', 's(3,7)', 's(4,7)', 's(5,7)', 's(6,7)', 's(7,7)',
'sin(1,freq=M)', 'cos(1,freq=M)', 'sin(2,freq=M)', 'cos(2,freq=M)', 'sin(3,freq=M)', 'cos(3,freq=M)', 'sin(4,freq=M)', 'cos(4,freq=M)']
s(?, 7)
is an encoding of weekday.
For each i=1, …, order, sin(i, freq=M)
can be calculated by:
Where is the proportion of time in a month. e.g.:
The first day is the origin (0, 0). The second is 1/31 month.
With these features, we can fit our linear model with code below:
model = LinearRegression().fit(X, y)
y_pred = pd.Series(
model.predict(X),
index=X.index,
name='Fitted',
)
We can remove the seasonal components from the sales, which is called deseasonalizing.
3.3 Check for remaining seasonality
Based on these periodograms, how effectively does it appear your model captured the seasonality in Average Sales? Does the periodogram agree with the time plot of the deseasonalized series?
The model was able to capture the seasonal variation in average sales, but it lacks the "peak" values (which are holidays, we'll discuss later).
If we plot the deseasonalized graph with holidays, we get below graph:
This means the seasonality can not capture the holiday features.
3.4 Create holiday features
What kind of features could you create to help your model make use of holiday information?
# YOUR CODE HERE
X_holidays = pd.get_dummies(holidays)
X2 = X.join(X_holidays, on='date').fillna(0.0)
# Check your answer
q_4.check()
We use pd.get_dummies
to encode holidays with one-hot encoding, and use them as extra features by merging them on date.
After that, we train the linear model with the extra holiday features:
model = LinearRegression().fit(X2, y)
y_pred = pd.Series(
model.predict(X2),
index=X2.index,
name='Fitted',
)
4 Time Series as Features
Predict the future from the past with a lag embedding.
Not every product family has sales showing cyclic behavior, and neither does the series of average sales. Sales of magazines, however, show patterns of growth and decay not well characterized by trend or seasons. In this question and the next, you'll model cycles in magazine sales using lag features.
Trend and seasonality will both create serial dependence that shows up in correlograms and lag plots. To isolate any purely cyclic behavior, we'll start by deseasonalizing the series.
4.1 Plotting cycles
Does this deseasonalized series show cyclic patterns? To confirm our intuition, we can try to isolate cyclic behavior using a moving-average plot just like we did with trend. The idea is to choose a window long enough to smooth over short-term seasonality, but short enough to still preserve the cycles.
Create a seven-day moving average from
y
, the series of magazine sales. Use a centered window, but don't set themin_periods
argument.
# YOUR CODE HERE
y_ma = y.rolling(7, center=True).mean()
# Plot
ax = y_ma.plot()
ax.set_title("Seven-Day Moving Average");
4.2 Examine serial dependence in Store Sales
Are any of the lags significant according to the correlogram? Does the lag plot suggest any relationships that weren't apparent from the correlogram?
We can see the lag 1 and lag 6 are significant. The lag plot suggests some non-linear effect.
4.3 Examine the time series features
Recall from the tutorial that a leading indicator is a series whose values at one time can be used to predict the target at a future time -- a leading indicator provides "advance notice" of changes in the target.
The competition dataset includes a time series that could potentially be useful as a leading indicator -- the onpromotion
series, which contains the number of items on a special promotion that day. Since the company itself decides when to do a promotion, there's no worry about "lookahead leakage"; we could use Tuesday's onpromotion
value to forecast sales on Monday, for instance.
Does it appear that either leading or lagging values of
onpromotion
could be useful as a feature?
The lag plot indicates that both leading and lagged values of on promotion
are correlated to magazine sales.
TODO: explain how it determines the on promotion
values useful.
4.4 Create time series features
Create the features indicated in the solution to Question 3. If no features from that series would be useful, use an empty dataframe
pd.DataFrame()
as your answer.
# YOUR CODE HERE: Make features from `y_deseason`
X_lags = make_lags(y_deseason, lags=1)
# YOUR CODE HERE: Make features from `onpromotion`
# You may want to use `pd.concat`
X_promo = pd.concat([make_lags(onpromotion, lags=1), onpromotion, make_leads(onpromotion, leads=1)], axis=1)
# YOUR CODE HERE: Make features from `oil`
X_oil = pd.DataFrame()
# pd.read_csv(comp_dir / 'oil.csv', parse_dates=['date'], infer_datetime_format=True).set_index('date')
X = pd.concat([X_lags, X_promo, X_oil], axis=1).dropna()
y, X = y.align(X, join='inner')
# Check your answer
q_4.check()
4.5 Create statistical features
Computing rolling statistics to be used as features is similar except we need to take care to avoid lookahead leakage.
- First, the result should be set at the right end of the window instead of the center -- that is, we should use
center=False
(the default) in therolling
method. - Second, the target should be lagged a step.
Edit the code in the next cell to create the following features:
- 14-day rolling median (median
) of lagged target
- 7-day rolling standard deviation (std
) of lagged target
- 7-day sum (sum
) of items "on promotion", with centered window
y_lag = mag_sales.loc[:, 'sales'].shift(1)
onpromo = mag_sales.loc[:, 'onpromotion']
# 28-day mean of lagged target
mean_7 = y_lag.rolling(7).mean()
# YOUR CODE HERE: 14-day median of lagged target
median_14 = y_lag.rolling(14).median()
# YOUR CODE HERE: 7-day rolling standard deviation of lagged target
std_7 = y_lag.rolling(7).std()
# YOUR CODE HERE: 7-day sum of promotions with centered window
promo_7 = onpromo.rolling(7, center=True).sum()
# Check your answer
q_5.check()
We don't lag the onpromo
because the company itself decides when to do a promotion, there's no worry about "lookahead leakage"
5 Hybrid Models
Combine the strengths of two forecasters with this powerful technique.
# You'll add fit and predict methods to this minimal class
class BoostedHybrid:
def __init__(self, model_1, model_2):
self.model_1 = model_1
self.model_2 = model_2
self.y_columns = None # store column names from fit method
5.1 Define fit method for boosted hybrid
Complete the
fit
definition for theBoostedHybrid
class.
def fit(self, X_1, X_2, y):
# YOUR CODE HERE: fit self.model_1
self.model_1.fit(X_1, y)
y_fit = pd.DataFrame(
# YOUR CODE HERE: make predictions with self.model_1
self.model_1.predict(X_1),
index=X_1.index, columns=y.columns,
)
# YOUR CODE HERE: compute residuals
y_resid = y - y_fit
y_resid = y_resid.stack().squeeze() # wide to long
# YOUR CODE HERE: fit self.model_2 on residuals
self.model_2.fit(X_2, y_resid)
# Save column names for predict method
self.y_columns = y.columns
# Save data for question checking
self.y_fit = y_fit
self.y_resid = y_resid
# Add method to class
BoostedHybrid.fit = fit
# Check your answer
q_1.check()
Follow steps below:
- train
model1
with featureX1
, targety
- make prediction
y1
withmodel1
- compute residuals with
y - y1
- train
model2
with featureX2
, targety - y1
5.2 Define predict method for boosted hybrid
Now define the
predict
method for theBoostedHybrid
class.
def predict(self, X_1, X_2):
y_pred = pd.DataFrame(
# YOUR CODE HERE: predict with self.model_1
self.model_1.predict(X_1),
index=X_1.index, columns=self.y_columns,
)
y_pred = y_pred.stack().squeeze() # wide to long
# YOUR CODE HERE: add self.model_2 predictions to y_pred
y_pred += self.model_2.predict(X_2)
return y_pred.unstack() # long to wide
# Add method to class
BoostedHybrid.predict = predict
# Check your answer
q_2.check()
Follow steps below:
- make prediction
y1
with featureX1
- make prediction
y2
with featureX2
- add
y1
andy2
5.3 Train boosted hybrid
Create the hybrid model by initializing a
BoostedHybrid
class withLinearRegression()
andXGBRegressor()
instances.
# YOUR CODE HERE: Create LinearRegression + XGBRegressor hybrid with BoostedHybrid
model = BoostedHybrid(LinearRegression(), XGBRegressor())
# YOUR CODE HERE: Fit and predict
model.fit(X_1, X_2, y)
y_pred = model.predict(X_1, X_2)
y_pred = y_pred.clip(0.0)
# Check your answer
q_3.check()
6 Forecasting With Machine Learning
Apply ML to any forecasting task with these four strategies.
6.1 Match description to dataset
Consider the following three forecasting tasks:
a. 3-step forecast using 4 lag features with a 2-step lead time
b. 1-step forecast using 3 lag features with a 1-step lead time
c. 3-step forecast using 4 lag features with a 1-step lead time
answer: 2-1-3
- The number of forecasting steps is the number of columns under Targets.
- The number of lags is the number of columns under Features.
- The lead time is the smallest lag step possible
6.2 Identify the forecasting task for Store Sales competition
Try to identify the forecast origin and the forecast horizon. How many steps are within the forecast horizon? What is the lead time for the forecast?
The "forecast origin" is time at which you are making a forecast. You may consider it to be the last timestamp of your training data.
The "forecast horizon" is the time range for which you are making a forecast.
The time between the origin and the horizon is the "lead time".
16 steps are within the forecast horizon, leading time is 1.
The training dataset ends on 2017/08/15
(the forecast origin), the test set comprises the dates 2017/08/16
to 2017/08/31
, which includes 16 days (the forecast horizon). There is 1 day between the origin and horizon, so the lead time is 1 day.
6.3 Create multistep dataset for store sales
Create targets suitable for the Store Sales forecasting task. Use 4 days of lag features. Drop any missing values from both targets and features.
# YOUR CODE HERE
y = family_sales.loc[:, 'sales']
# YOUR CODE HERE: Make 4 lag features
X = make_lags(y, lags=4).dropna()
# YOUR CODE HERE: Make multistep target
y = make_multistep_target(y, steps=16).dropna()
y, X = y.align(X, join='inner', axis=0)
# Check your answer
q_3.check()
6.4 Forecast with the DirRec strategy
Instatiate a model that applies the DirRec strategy to XGBoost.
DirRec strategy is a combination of the direct and recursive strategies: train a model for each step and use forecasts from previous steps as new lag features.
from sklearn.multioutput import RegressorChain
# YOUR CODE HERE
model = RegressorChain(XGBRegressor())
# Check your answer
q_4.check()