Discovery the motivation of behavior from electricity consumption signal

Key Points

This chapter briefly elaborates how to analyze the motivation of people's operation on a system from the system electricity consumption signal and other data.

  • Objective: understand how, and why occupants interact with the system.
  • System: ventilation system in passive houses with adjustable flow rate option.
  • Raw data: electricity consumption signal; environment sensor records (temperature, humidity, CO2 etc.); 3-min interval * 2 years (2013-2015): 325946 rows × 25 features.
  • Technique: Noise reduction (Gaussian filter); Edge detection (1st derivative Gaussian filter), Feature selection (L1-penalized logistic regression, recursive feature elimination)

Below Figure 1 shows the overall pipeline I designed.

flowchart

(1) After essential preprocessing and cleaning (NaNs are backfilled), start with a system electricity consumption signal like Figure 2 below. A sudden change in the signal could imply the occupants' interaction with the system (e.g. once the occupant turn the flow rate into a higher option there should be a steep increasing edge on the electricity consumption signal). First thing to do is filtering out the noise (caused by wind etc. or system itself) and "fake operation" (status change with too-short duration).

Figure 2 Demo of Electricity Consumption Signal

(2) Through a finely-tuned 1st derivative Gaussian filter, the noise and "fake operation" could be filtered out and the valid operations would be marked out, like shown in Figure 3 below.

Noise reduced signal
1st Derivative filtered signal
Operation Marked

(3) Then the marked data set would undergo an undersampling process since the dataset is now skewed (The no. of records marked with 'no operation' is far more than ones with operation, either increase or decrease). The undersampling process ensures the data set has balanced scales with each class, for the effectiveness of following classification algorithm.

(4) After undersampling, the training set would be normalized and then fed into a L1-penalized logistic regression classifier. Since linear model penalized with L1 norm has sparse solutions i.e. many of its estimated coefficients would be zero, it could be used for feature selection purpose. Figure 4 below shows an example of the coefficients output in a certain experiment.

Coefficient output of logistic regression model

Then the logistic regression runs repeatedly to make a recursive feature elimination (first, the estimator is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features). At last, the most informative feature combination (judged by cross-validation accuracy) in this case could be determined, like below Figure 5 shows: these features implies this occupant's motivation for his/her behavior.

Best feature combination after recursive feature elimination

(5) Repeat the process above for different occupants. The results imply there are different kinds of people since their "best feature combination" vary a lot: e.g. some of them are with strong "time pattern" while others may be more sensitive to indoor environment, like temperature etc. A K-Means clustering could help us demonstrate this by grouping the occupants into different user profiles.

Grouping: different user profiles found

From here below is technical log regarding relevant theory and code to realize the whole process.

Technical Details: Noise reduction & Edge detection

Context

There is a ventilation system (with heat recovery) in one passive house, of which the ventilation flow rate is controlled by a fan system, and adjustable by occupants. There are 3 available options (let's say, low, medium, high rate respectively)for the fan flow rate setting.

The electricity consumption of the fan system is recorded by a smart meter in terms of pulse. Obviously, occupants' flow rate setting could put significant influence on the electricity consumption and we could calibrate when and how people adjust their ventilation system based on the electricity consumption.

However, on the one hand, with the influence of back pressure, wind speed etc. the record is not something like a clear 3-stage square wave, instead it is quite noisy. On the other hand, we got many different houses (with similar structure but with different scales of records) within our research. They made it is not really practical to calibrate the ventilation setting position by fixed intervals (like pulse < 3 == position 1; 3 < pulse < 5 == position 2 etc.). We need a new algorithmic method to do this job.

This is a tiny piece of the elec. consumption record (day 185 in year 2014, house #9):


</div>

Methodology

Describe what we want to do in a few words: smooth the noise and detect the edge automatically, without any reset like boundary interval needed. This is actually a classic problem in signal processing or computer vision field. For this 1D signal the simplest solution maybe Gaussian derivative filter, for similar problems in 2D matrix (images) the canny edge detector could be effective. The figure may give you a vivid impression of what we are going to do:


</div>

Basic idea

Tune a Gaussian derivative filter to properly smooth the noise and take 1st derivative, then set an appropriate threshold to detect the edge.

Terms

Gaussian filter

For noise smoothing or "image blur". In layman's words, replace each point by the weighted average of its neighbors, the weights come from Gaussian distribution, then normalize the results.


</div>

Gaussian

(if you are dealing with 2d matrix (images), use 2-D Gaussian instead.)


</div>

Effect:

Gaussian derivative filter

For noise smoothing and edge detection. In layman's words, replace each point by the weighted average of its neighbors, the weights come from the 1st derivative of Gaussian distribution, then normalize the results.


</div>


</div>

Effect:


</div>

Advantages of Gaussian Kernel compared to other low-pass filter:

  • Being possible to derive from a small set of scale-space axioms.
  • Does not introduce new spurious structures at coarse scales that do not correspond to simplifications of corresponding structures at finer scales.

Scale Space

Representing an signal/image as a one-parameter family of smoothed signals/images, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. Specially for Gaussian kernels: t = sigma^2.


</div>

Results

Finished a demo of auto edge detection in our elec. consumption record, which contains a tuned Gaussian derivative filter, edge position detected, and scale space plot.

Original


</div>

Gaussian filter smoothed (sigma = 8)


</div>

1st derivative Gaussian filtered (sigma = 8)


</div>

Edge position detected (threshold = 0.07 * global min/max)


</div>

Scale Space (sigma = range (1,9))


</div>

Finer Tuning

In practice, it is usually needed to use different tailored tune strategies for the parameters to meet the specific requirements aroused by researchers. E.g. in a case the experts from built environment would like to filter out short-lived status (even they maybe quite steep in terms of pulse number). The strategies is carefully increase sigma (by which you are flattening the Gaussian curve, so the weights of center would be less significant so that the short peaks could be better wiped out by its flat neighbors) and also, properly increase the threshold would help (by which it would be more difficult for the derivatives of smoothed short peaks to pass the threshold and be recognized as one effective operation). Once the sigma and threshold reached an optimized combination, the results would be something like below for this case:

Edge position detected (Sigma = 10, threshold = 0.35 * global min/max)


</div>

In a larger scale, see how does our finely-tuned lazy filter work to filter the fake operations out! (Sigma = 20, threshold = 0.5 * global min/max)


</div>

Reference

Technical Details: Feature selection

Before feature selection I made an undersampling to the data set to ensure every class shares a balanced weight in the whole dataset (before which the ratio is something like 150,000 no operation, 400 increase, 400 decrease).

The feature selection process is carried out in Python with scikit-learn. First each feature in the data set need to be standardized since the objective function of the l1 regularized linear model we use in this case assumes that all features are centered on zero and have variance in the same order. If a feature has a significantly lager scale or variance compared to others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. In this case I used sklearn.preprocessing.scale() to standardize each feature to zero mean and unit variance.

Then the standardized data set was fed into a recursive feature elimination with cross-validation (REFCV) loop with a L1-penalized logistic regression kernel since linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero, which could be used for feature selection purpose.

Below is the main part of the coding script for this session (ipynb format).

Feature selection after Gaussian filter
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import math

Data Loading

ventpos = pd.read_csv("/Users/xinyuyangren/Desktop/1demo.csv")
ventpos = ventpos.fillna(method='backfill')
ventpos.head()
ventpos.op.value_counts()

UnderSampling

sample_size = math.ceil((sum(ventpos.op == 1) + sum(ventpos.op == -1))/2)
sample_size
noop_indices = ventpos[ventpos.op == 0].index
noop_indices
random_indices = np.random.choice(noop_indices, sample_size, replace=False)
random_indices
noop_sample = ventpos.loc[random_indices]
up_sample = ventpos[ventpos.op == 1]
down_sample = ventpos[ventpos.op == -1]
op_sample = pd.concat([up_sample,down_sample])
op_sample.head()

Feature selection: up operation

undersampled_up = pd.concat([up_sample,noop_sample])
undersampled_up.head()
#generate month/hour attribute from datetime string  
undersampled_up.dt = pd.to_datetime(undersampled_up.dt)
t = pd.DatetimeIndex(undersampled_up.dt)
hr = t.hour
undersampled_up['HourOfDay'] = hr
month = t.month
undersampled_up['Month'] = month
year = t.year
undersampled_up['Year'] = year
undersampled_up.head()
for col in undersampled_up:
    print col
def remap(x):
    if x == 't':
        x = 0
    else:
        x = 1
    return x

for col in ['wc_lr', 'wc_kitchen', 'wc_br3', 'wc_br2', 'wc_attic']:
    w = undersampled_up[col].apply(remap)
    undersampled_up[col] = w
undersampled_up.head()
openwin = undersampled_up.wc_attic + undersampled_up.wc_br2 + undersampled_up.wc_br3 + undersampled_up.wc_kitchen + undersampled_up.wc_lr
undersampled_up['openwin'] = openwin;
undersampled_up = undersampled_up.drop(['wc_lr', 'wc_kitchen', 'wc_br3', 'wc_br2', 'wc_attic','Year','dt','pulse_channel_ventilation_unit'],axis = 1)
undersampled_up.head()
for col in undersampled_up:
    print col

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
#shuffle the order
undersampled_up = undersampled_up.reindex(np.random.permutation(undersampled_up.index))
undersampled_up.head()
y = undersampled_up.pop('op')
# Columnwise Normalizaion
from sklearn import preprocessing
X_scaled = pd.DataFrame()
for col in undersampled_up:
    X_scaled[col] = preprocessing.scale(undersampled_up[col])
X_scaled.head()
from sklearn import cross_validation
lg = LogisticRegression(penalty='l1',C = 0.1)
scores = cross_validation.cross_val_score(lg, X_scaled, y, cv=10)
#The mean score and the 95% confidence interval of the score estimate
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
clf = lg.fit(X_scaled, y)
plt.figure(figsize=(12,9))
y_pos = np.arange(len(X_scaled.columns))
plt.barh(y_pos,abs(clf.coef_[0]))
plt.yticks(y_pos + 0.4,X_scaled.columns)
plt.title('Feature Importance from Logistic Regression')

REFCV FEATURE OPTIMIZATIN

from sklearn.feature_selection import RFECV
selector = RFECV(lg, step=1, cv=10)
selector = selector.fit(X_scaled, y)
mask = selector.support_ 
mask
selector.ranking_
X_scaled.keys()[mask]
selector.score(X_scaled, y)
X_selected = pd.DataFrame()
for col in X_scaled.keys()[mask]:
    X_selected[col] = X_scaled[col]
X_selected.head()
scores = cross_validation.cross_val_score(lg, X_selected, y, cv=10)
#The mean score and the 95% confidence interval of the score estimate
scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
clf_final = lg.fit(X_selected,y)
y_pos = np.arange(len(X_selected.columns))
plt.barh(y_pos,abs(clf_final.coef_[0]))
plt.yticks(y_pos + 0.4,X_scaled.columns)
plt.title('Feature Importance After RFECV Logistic Regression')
Reference
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,254评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,875评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,682评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,896评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,015评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,152评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,208评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,962评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,388评论 1 304
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,700评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,867评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,551评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,186评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,901评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,142评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,689评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,757评论 2 351

推荐阅读更多精彩内容

  • 闹钟调晚了半小时,如厕思考后,已近七点十分,为了能够专注,带上耳机,选择了一首久石让《人生的回转木马》,伴随着这个...
    刘小权PCC认证教练阅读 814评论 0 4
  • Version是app发布时用户看到的版本号。 Build的为了方便开发者多次提交binary, 比如被苹果rej...
    XLsn0w阅读 489评论 0 1
  • 01你选择什么样的生活? 与一妹纸外出,转搭地铁,恰逢早高峰,地铁的人很多。对于广州早高峰这种拥挤的场景,我非常熟...
    念之声阅读 307评论 0 1
  • 1.使用Three.js渲染导出的DAE 在Three.js中使用Collada(即.dae)文件的话,首先得要用...
    kingder阅读 3,359评论 1 1
  • 四十多年来,这三十天使人难忘,叫人激动! 无意间的一次刷屏,邂逅了这个‘’好报三十天写作‘’群,于是,年少时的那个...
    宝天曼风景阅读 290评论 0 0