Notes: Anomaly Detection in Time Series Data Using LSTMs and Automatic Thresholding

### 基础攻防领域的异常检测问题

```count(distinct request) group by src_ip,dst_ip,dst_port
count(distinct value) group by src_ip,dst_ip,key```

### 3-sigma与滑动窗口

3sigma方法中，异常值被定义为一组结果值中与平均值的偏差超过三倍标准差的值。在标注正态分布的假设下，超过3sigma的数据概率为0.003，因此可认为是异常值。

DataCon DNS方向第一名思路分享pdf

### 动态阈值算法

Anomaly Detection in Time Series Data Using LSTMs and Automatic Thresholding

1. 滑动窗口+LSTM学习时序行为，并给出下一步预测。

1. 求预测值与真实观测值的误差。

1. 对误差使用EWMA平滑。

1. 采用动态阈值方法判定该误差是否为异常。

https://github.com/khundman/telemanom

```def find_epsilon(e_s, error_buffer, sd_lim=12.0):
'''Find the anomaly threshold that maximizes function representing tradeoff between a) number of anomalies
and anomalous ranges and b) the reduction in mean and st dev if anomalous points are removed from errors
(see https://arxiv.org/pdf/1802.04431.pdf)
Args:
e_s (array): residuals between y_test and y_hat values (smoothes using ewma)
error_buffer (int): if an anomaly is detected at a point, this is the number of surrounding values
to add the anomalous range. this promotes grouping of nearby sequences and more intuitive results
sd_lim (float): The max number of standard deviations above the mean to calculate as part of the
argmax function
Returns:
sd_threshold (float): the calculated anomaly threshold in number of standard deviations above the mean
'''
mean = np.mean(e_s)
sd = np.std(e_s)
max_s = 0
sd_threshold = sd_lim # default if no winner or too many anomalous ranges
for z in np.arange(2.5, sd_lim, 0.5):
epsilon = mean + (sd*z)
pruned_e_s, pruned_i, i_anom  = [], [], []
for i,e in enumerate(e_s):
if e < epsilon:
pruned_e_s.append(e)
pruned_i.append(i)
if e > epsilon:
for j in range(0, error_buffer):
if not i + j in i_anom and not i + j >= len(e_s):
i_anom.append(i + j)
if not i - j in i_anom and not i - j < 0:
i_anom.append(i - j)
if len(i_anom) > 0:
# preliminarily group anomalous indices into continuous sequences (# sequences needed for scoring)
i_anom = sorted(list(set(i_anom)))
groups = [list(group) for group in mit.consecutive_groups(i_anom)]
E_seq = [(g[0], g[-1]) for g in groups if not g[0] == g[-1]]
perc_removed = 1.0 - (float(len(pruned_e_s)) / float(len(e_s)))
mean_perc_decrease = (mean - np.mean(pruned_e_s)) / mean
sd_perc_decrease = (sd - np.std(pruned_e_s)) / sd
s = (mean_perc_decrease + sd_perc_decrease) / (len(E_seq)**2 + len(i_anom))
#             print('z=',z,'s=',s)
# sanity checks
if s >= max_s and len(E_seq) <= 5 and len(i_anom) < (len(e_s)*0.5):
sd_threshold = z
max_s = s
return sd_threshold #multiply by sd to get epsilon```

```def calc(x,y,rolling_window,err_buffer,max_z=6):
y = y.astype(float)
mean = y.rolling(rolling_window).mean()
ewma = y.ewm(span=rolling_window*0.03,ignore_na=True).mean()
std = y.rolling(rolling_window).std()
epsilon = y.rolling(rolling_window).apply(lambda x:find_epsilon(x,err_buffer,max_z),raw=True)
#     epsilon = y.rolling(rolling_window).apply(lambda x:find_epsilon_new(x,err_buffer,max_z),raw=True)
threshold_test = mean+epsilon*std
threshold_3sigma = mean+3*std
print('calc on: \nlen(y)',len(y),'\nrolling_window',rolling_window,'\nerr_buffer',err_buffer,'\nmax_z',max_z)
plt.figure(figsize=(20,5))
plt.plot(x,y,label='y') # 观测值
plt.plot(x,mean,label='mean') # 均值
plt.plot(x,ewma,label='ewma') # 加权平均
plt.plot(x,threshold_3sigma,label='z_3') # 3-sigma
plt.plot(x,threshold_test,label='z_dynamic') # 动态阈值
for i in range(len(y)):
if y[i]>threshold_3sigma[i]: # 3-sigma检测结果
plt.vlines(x[i],ymin=-1,ymax=-0.1)
if y[i]>threshold_test[i]: # 动态阈值检测结果
plt.vlines(x[i],ymin=-2,ymax=-1.1)
plt.legend()
plt.show()```

`sd_threshold = sd_lim # default if no winner or too many anomalous ranges`

`threshold_ewma = threshold_test.ewm(span=rolling_window*0.03,ignore_na=True).mean()`