【拓扑异常检测】Topological Anomaly Detection

Topological Anomaly Detection

Posted onAugust 4, 2014byshaggorama

(tl;dr:https://github.com/dmarx/Topological-Anomaly-Detection)

Recently I had the pleasure of attending a presentation byDr. Bill Basener, one of the authors ofthis paperwhich describes an outlier analysis technique called Topological Anomaly Detection (TAD). Despite the intimidating name, the algorithm is extremely simple, both to understand and to implement.

The technique is essentially a density based outlier detection algorithm that, instead of calculating local densities, constructs a graph of the data using nearest-neighbors. The algorithm is different fromother kNN outlier detection algorithmsin that instead of setting ‘k’ as a parameter, you instead set a maximal inter-observation distance (called the graph “resolution” by Gartley and Basener). If the distance between two points is less than the graph resolution, add an edge between those two observations to the graph. Once the full graph is constructed, determine which connected components comprise the “background” of the data by setting some threshold percentage of observations ‘p’: any components with fewer than ‘p’ observations is considered an anomalous component, and all the observations (nodes) in this component are outliers.

As you probably gathered from my description, it’s a pretty simple algorithm to implement. Let’s walk through it step by step.

First, we need to construct a distance matrix. SciPy has a great utility for this calledpdist:

from scipy.spatial.distance import pdist

distance_matrix = pdist(X)

One thing that’s nice about this is that it will work for either numpy arrays or pandas dataframes, as long as the observations are on the rows and the features are on the columns. The output of pdist is, by default, a “condensed” distance matrix, which is just the upper triangular matrix collapsed into a flat vector. This is nice because it takes up (just) less than half the memory, but it’s slightly more cumbersome to work with than a straight up distance matrix.

We’re ultimately going to treat this distance matrix as an adjacency matrix, but before we do, we need to replace all entries in the distance matrix (ok, distancevector) that are larger than the graph resolution “r” with zeroes to remove those edges from the graph. I think it’s a little silly to expect the user to come up with a discrete distance to use for the resolution, so we’re going to give the user the option to provide a quantile instead as the ‘rp’ parameter below.

It’s tempting to just modify the distance matrix in place, but we’re going to need it later to score outliers, so we’re going to work with a copy instead.

def trim_adjacency_matrix(adj, r=None, rq=.1):

if r is None:

# This is really just a lazy quantile function.

q = int(np.floor(len(adj)*rq))

print "q:", q

r = np.sort(adj)[q]

print "r:", r

adj2 = adj.copy()

adj2[adj>r] = 0

return adj2, r

Tada! We’ve got our graph. If you don’t work with graph data a lot you might not see it, but an adjacency matrix is just a way of encoding a graph. But, the adjacency matrix isn’t enough: we want to mine the connected components from the graph. Thankfully, this is trivial with networkx.

Now, networkx expects a square matrix if we’re going to build a graph using an adjacency matrix, but we have a vector. We could convert this to a full matrix by callingscipy.spatial.distance.squareform, but this will take up double the space in memory and it’s possible that a user is working with a large enough dataset that this will be a problem, so let’s work around the condensed distance matrix.

There’s probably a better way to go about this–like a formula that converts (i, j) pairs into the appropriate indexes into this condensed matrix (vector)–but I was lazy when I wrote this up so instead of trying to find the right formula, I instead took some inspiration from a stackoverflow post discussinghow to work with condensed distance matriceswhich led me to iterate through each vector index paired with its corresponding (i, j) index in the square matrix usingenumerate(itertools.combinations). This essentially gives me an edgelist, permitting to build up the graph one edge at a time:

from itertools import combinations

import networkx as nx

def construct_graph(edges, n):

g = nx.Graph()

for z, ij in enumerate(combinations(range(n),2)):

d = edges[z]

if d:

i,j = ij

g.add_edge(i,j, weight=d)

return g

Having constructed this graph structure, we can extract the connected components and score them as “background” or “anomaly” by counting the number of nodes in each component and comparing against the ‘p’ threshold. As a convenience for later, I’m also going to add “class” and “color” attributes to all the nodes in the graph.

def flag_anomalies(g, min_pts_bgnd, node_colors={'anomalies':'r', 'background':'b'}):

res = {'anomalies':[],'background':[]}

for c in nx.connected_components(g):

if len(c) < min_pts_bgnd:

res['anomalies'].extend(c)

else:

res['background'].extend(c)

for type, array in res.iteritems():

for node_id in array:

g.node[node_id]['class'] = type

g.node[node_id]['color'] = node_colors[type]

return res, g

Last but not least, let’s score those anomalies. For convenience, I’m going to wrap these scores in apandas.Series, but this is the only place in our code we’re using pandas so it would actually speed us up a bit not to do it this way (because then we can completely eliminate the pandas import):

import pandas as pd

def calculate_anomaly_scores(classed, adj, n):

scores = {}

for a in classed['anomalies']:

scores[a] = 0

for z, ij in enumerate(combinations(range(n),2)):

i,j = ij

if (i == a or j == a) and (

i in classed['background'] or

j in classed['background']):

d = adj[z]

if scores[a]:

scores[a] = np.min([scores[a], d])

else:

scores[a] = d

return pd.Series(scores)

Great! Now let’s put all the pieces together to construct out outlier classification tool.

def tad_classify(X, method='euclidean', r=None, rq=.1, p=.1, distances=None):

if not distances:

adj = pdist(X, method)

edges, r = trim_adjacency_matrix(adj, r, rq)

n = X.shape[0]

g = construct_graph(edges, n)

classed, g =  flag_anomalies(g, n*p)

scores = calculate_anomaly_scores(classed, adj, n)

return {'classed':classed, 'g':g, 'scores':scores, 'r':r, 'min_pts_bgnd':n*p, 'distances':adj}

Now that we’ve built this thing, let’s try it out on the iris data. I’m going to visualize the result using a pairs plot (a “scatter_matrix” in pandas) which will allow us to see how the outliers relate to the rest of the data across all pairs of dimensions along which we can slice the data.

import matplotlib.pyplot as plt

from pandas.tools.plotting import scatter_matrix

from sklearn import datasets

iris = datasets.load_iris()

df = pd.DataFrame(iris.data)

res = tad_classify(df)

df['anomaly']=0

df.anomaly.ix[res['classed']['anomalies']] = 1

scatter_matrix(df.ix[:,:4], c=df.anomaly, s=(25 + 50*df.anomaly), alpha=.8)

plt.show()

The pairs plot it a great way to visualize the data, but if we had more than 4 dimensions this wouldn’t be a viable option. Instead, let’s reduce the dimensionality of the data using PCA just for visualization purposes. While we’re at it, let’s actually visualize how the observations are connected in the graph components to get a better idea of what the algorithm is doing.

from sklearn.decomposition import PCA

g = res['g']

X_pca = PCA().fit_transform(df)

pos = dict((i,(X_pca[i,0], X_pca[i,1])) for i in range(X_pca.shape[0]))

colors = [node[1]['color'] for node in g.nodes(data=True)]

labels = {}

for node in g.nodes():

if node in res['classed']['anomalies']:

labels[node] = node

else:

labels[node] = ''

nx.draw(g, pos=pos, node_color = colors, labels=labels)

plt.show()

If we reduce the graph resolution, we get more observations classed as outliers. Here’s what it looks like with rq=0.05 (the default–above–is 0.10):

And here’s rq=0.03:

In case these images don’t give you the intuition, reducing the graph resolution results in breaking up the graph into more and more components. Changing the ‘p’ parameter has a much less granular effect (as long as most of the observations are in a small number of components): changing ‘p’ will have basically no effect for small increments, but above a threshold we end up classifying large swathes of the graph as outliers.

The images above have slightly different layouts because I ran PCA fresh each time. In retrospect, it would have looked better if I hadn’t done that, but you get the idea. Maybe I’ll fix that later.

Share this:

Twitter

Facebook

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,222评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,455评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,720评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,568评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,696评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,879评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,028评论 3 409
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,773评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,220评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,550评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,697评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,360评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,002评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,782评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,010评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,433评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,587评论 2 350

推荐阅读更多精彩内容

  • 新媒体运营,已经是各大小公司的标配岗位了,不过不同公司的新媒体运营团队大小也各不相同,据小编所知,很多做新媒体的都...
    Klicky阅读 489评论 0 3
  • 1/4 Listening There are many forms of life on Earth, incl...
    pyLemon阅读 6,062评论 0 23
  • 又到了第二十一天,又到了该总结的时候。 不同于上一周期的是,这一周期的阅读,放弃了对赌基金的使用。同时,也没有了上...
    毛旭天阅读 187评论 1 3
  • /* 这里只作简单描述哦,具体详细了解 请前往https://github.com/tomaz/appledoc ...
    SincereDu阅读 389评论 0 0