## 目录

5.1.2 LSTM的遗忘门
5.1.3 LSTM的输入门
5.1.4 LSTM的C（cell）状态更新
5.1.5 LSTM的输出门
5.1.6 LSTM的前向传播总结
5.2 LSTM的反向梯度推导
5.3 LSTM能改善梯度消失的原因

$\boldsymbol{a}^{\boldsymbol{l}}$则表示$\boldsymbol{z}^{\boldsymbol{l}}$经过激活函数后的输出。

## 5.1.2 LSTM之遗忘门

$\boldsymbol{f}^{(t)} = \sigma\left( {\boldsymbol{W}_{f}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{f}\boldsymbol{x}^{(t – 1)} + \boldsymbol{b}_{f}} \right)$

## 5.1.3 LSTM之输入门

$\boldsymbol{i}^{(t)} = \sigma\left( {\boldsymbol{W}_{i}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{i}\boldsymbol{x}^{(t – 1)} + \boldsymbol{b}_{i}} \right)$

$\boldsymbol{a}^{(t)} = \sigma\left( {\boldsymbol{W}_{a}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{a}\boldsymbol{x}^{(t – 1)} + \boldsymbol{b}_{a}} \right)$

## 5.1.4 LSTM之细胞状态更新

$\boldsymbol{C}^{(t)} = \boldsymbol{C}^{(t)}\bigodot\boldsymbol{f}^{(t)} + \boldsymbol{a}^{(t)}\bigodot\boldsymbol{i}^{(t)}$

## 5.1.5 LSTM之输出门

$\boldsymbol{o}^{(t)} = \sigma\left( {\boldsymbol{W}_{o}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{o}\boldsymbol{x}^{(t – 1)} + \boldsymbol{b}_{o}} \right)$

$\boldsymbol{h}^{(t)} = \boldsymbol{o}^{(t)}\bigodot tanh\left( \boldsymbol{C}^{(t)} \right)$

## 5.1.6 LSTM前向传播总结

1）更新遗忘门输出：

$\boldsymbol{f}^{(t)} = \sigma\left( {\boldsymbol{W}_{f}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{f}\boldsymbol{x}^{(t)} + \boldsymbol{b}_{f}} \right)$

2）更新输入门和其控制对象：

$\boldsymbol{i}^{(t)} = \sigma\left( {\boldsymbol{W}_{i}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{i}\boldsymbol{x}^{(t)} + \boldsymbol{b}_{i}} \right)$

$\boldsymbol{a}^{(t)} = tanh\left( {\boldsymbol{W}_{a}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{a}\boldsymbol{x}^{(t)} + \boldsymbol{b}_{a}} \right)$

3）更新细胞状态，从而$\left. \boldsymbol{C}^{(t – 1)}\longrightarrow\boldsymbol{C}^{(t)} \right.$：

$\boldsymbol{C}^{(t)} = \boldsymbol{C}^{(t – 1)}\bigodot\boldsymbol{f}^{(t)} + \boldsymbol{a}^{(t)}\bigodot\boldsymbol{i}^{(t)}$

4）更新输出门和其控制对象，从而$\left. \boldsymbol{h}^{(t – 1)}\longrightarrow\boldsymbol{h}^{(t)} \right.$：

$\boldsymbol{o}^{(t)} = \sigma\left( {\boldsymbol{W}_{o}\boldsymbol{h}^{(t – 1)} + \boldsymbol{U}_{o}\boldsymbol{x}^{(t – 1)} + \boldsymbol{b}_{o}} \right)$

$\boldsymbol{h}^{(t)} = \boldsymbol{o}^{(t)}\bigodot tanh\left( \boldsymbol{C}^{(t)} \right)$

5）得到当前时间步$t$的预测输出：

${\hat{\boldsymbol{y}}}^{(t)} = \sigma\left( {\boldsymbol{V}\boldsymbol{h}^{(t)} + \boldsymbol{c}} \right)$

## 5.2 LSTM的反向梯度推导

$\boldsymbol{\delta}_{h}^{(t)} = \frac{\partial L}{\partial\boldsymbol{h}^{(t)}}$

$\boldsymbol{\delta}_{C}^{(t)} = \frac{\partial L}{\partial\boldsymbol{C}^{(t)}}$

$\boldsymbol{\delta}_{h}^{(T)} = \boldsymbol{V}^{T}\left( {{\hat{\boldsymbol{y}}}^{(T)} – \boldsymbol{y}^{(T)}} \right)$

$\boldsymbol{\delta}_{C}^{(T)} = \left( \frac{\partial\boldsymbol{h}^{(T)}}{\partial\boldsymbol{C}^{(T)}} \right)^{T}\frac{\partial L}{\partial\boldsymbol{h}^{(T)}} = \boldsymbol{\delta}_{h}^{(T)}\bigodot\boldsymbol{o}^{(T)}\bigodot{tanh}^{‘}\left( \boldsymbol{C}^{(T)} \right)$

1）$\left. l\left( t \right)\longrightarrow\boldsymbol{h}^{(t)} \right.$

2）$\left. \boldsymbol{h}^{(t + 1)}\longrightarrow\boldsymbol{o}^{(t + 1)}\longrightarrow\boldsymbol{h}^{(t)} \right.$

3）$\left. \boldsymbol{C}^{(t + 1)}\longrightarrow\boldsymbol{i}^{(t + 1)}\longrightarrow\boldsymbol{h}^{(t)} \right.$

4）$\left. \boldsymbol{C}^{(t + 1)}\longrightarrow\boldsymbol{a}^{(t + 1)}\longrightarrow\boldsymbol{h}^{(t)} \right.$

5）$\left. \boldsymbol{C}^{(t + 1)}\longrightarrow\boldsymbol{f}^{(t + 1)}\longrightarrow\boldsymbol{h}^{(t)} \right.$

$\boldsymbol{\delta}_{h}^{(t)} = \frac{\partial L\left( t \right)}{\partial\boldsymbol{h}^{(t)}} = \frac{\partial l\left( t \right)}{\partial\boldsymbol{h}^{(t)}} + \left( \frac{\partial\boldsymbol{C}^{(t + 1)}}{\partial\boldsymbol{h}^{(t)}} \right)^{T}\boldsymbol{\delta}_{C}^{(t + 1)} + \left( {\frac{\partial\boldsymbol{h}^{(t + 1)}}{\partial\boldsymbol{o}^{(t)}}\frac{\partial\boldsymbol{o}^{(t + 1)}}{\partial\boldsymbol{h}^{(t)}}} \right)^{T}\boldsymbol{\delta}_{h}^{(t + 1)}$

$\frac{\partial l\left( t \right)}{\partial\mathbf{h}^{(t)}} = \mathbf{V}^{T}\left( {{\hat{\mathbf{y}}}^{(t)} – \mathbf{y}^{(t)}} \right)$

$d\boldsymbol{C}^{(t + 1)} = \boldsymbol{C}^{(t)}\bigodot d\boldsymbol{f}^{({t + 1})} + \boldsymbol{i}^{({t + 1})}\bigodot d\boldsymbol{a}^{({t + 1})} + \boldsymbol{a}^{({t + 1})}\bigodot d\boldsymbol{i}^{({t + 1})}$

$= diag\left( \boldsymbol{C}^{(t)} \right)d\boldsymbol{f}^{({t + 1})} + diag\left( \boldsymbol{i}^{({t + 1})} \right)d\boldsymbol{a}^{({t + 1})} + diag\left( \boldsymbol{a}^{({t + 1})} \right)d\boldsymbol{i}^{({t + 1})}$

$= diag\left( \boldsymbol{C}^{(t)} \right)d\boldsymbol{f}^{({t + 1})} + diag\left( \boldsymbol{a}^{({t + 1})} \right)d\boldsymbol{i}^{({t + 1})} + diag\left( \boldsymbol{i}^{({t + 1})} \right)d\boldsymbol{a}^{({t + 1})}$

$\left. = diag\left( {\boldsymbol{C}^{(t)}\bigodot\boldsymbol{f}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{f}^{({t + 1})}} \right)} \right)\boldsymbol{W}_{f}d\boldsymbol{h}^{(t)} + \right.\sim\left. + \right.\sim$

$\left. = \right.\sim\left. + diag\left( {\boldsymbol{a}^{({t + 1})}\bigodot\boldsymbol{i}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{i}^{({t + 1})}} \right)} \right)\boldsymbol{W}_{i}d\boldsymbol{h}^{(t)} + \right.\sim$

$\left. d\boldsymbol{C}^{(t + 1)} = \right.\sim\left. + \right.\sim + diag\left( {\boldsymbol{i}^{({t + 1})}\bigodot\left( {1 – {\boldsymbol{a}^{({t + 1})}}^{2}} \right)} \right)\boldsymbol{W}_{a}d\boldsymbol{h}^{(t)}$

$\frac{\partial\boldsymbol{C}^{(t + 1)}}{\partial\boldsymbol{h}^{(t)}} = diag\left( {\boldsymbol{C}^{(t)}\bigodot\boldsymbol{f}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{f}^{({t + 1})}} \right)} \right)\boldsymbol{W}_{f} + diag\left( {\boldsymbol{a}^{({t + 1})}\bigodot\boldsymbol{i}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{i}^{({t + 1})}} \right)} \right)\boldsymbol{W}_{i} + diag\left( {\boldsymbol{i}^{({t + 1})}\bigodot\left( {1 – {\boldsymbol{a}^{({t + 1})}}^{2}} \right)} \right)\boldsymbol{W}_{a}$

$d\boldsymbol{h}^{({t + 1})} = tanh\left( \boldsymbol{C}^{({t + 1})} \right)\bigodot d\boldsymbol{o}^{({t + 1})} = diag\left( {tanh\left( \boldsymbol{C}^{({t + 1})} \right)} \right)diag\left( {\boldsymbol{o}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{o}^{({t + 1})}} \right)} \right)d\left( {\boldsymbol{W}_{o}\boldsymbol{h}^{(t)}} \right) = diag\left( {tanh\left( \boldsymbol{C}^{({t + 1})} \right)\bigodot\boldsymbol{o}^{({t + 1})}\bigodot\left( {1 – \boldsymbol{o}^{({t + 1})}} \right)} \right)\boldsymbol{W}_{o}d\boldsymbol{h}^{(t)}$

$\boldsymbol{\delta}_{C}^{(t)} = \left( \frac{\partial\boldsymbol{h}^{(t)}}{\partial\boldsymbol{c}^{(t)}} \right)^{T}\boldsymbol{\delta}_{h}^{(t)} + {\left( \frac{\partial\boldsymbol{c}^{(t + 1)}}{\partial\boldsymbol{c}^{(t)}} \right)^{T}\boldsymbol{\delta}}_{C}^{(t + 1)}$

$\boldsymbol{\delta}_{C}^{(t)} = \left( \frac{\partial\boldsymbol{h}^{(t)}}{\partial\boldsymbol{c}^{(t)}} \right)^{T}\boldsymbol{\delta}_{h}^{(t)} + {\left( \frac{\partial\boldsymbol{c}^{(t + 1)}}{\partial\boldsymbol{c}^{(t)}} \right)^{T}\boldsymbol{\delta}}_{C}^{(t + 1)} = \boldsymbol{o}^{(t)} \odot \left( {1 – {tanh}^{2}\left( \boldsymbol{C}^{(t)} \right)} \right)^{2}{\odot \boldsymbol{\delta}}_{h}^{(t)} + \boldsymbol{f}^{({t + 1})} \odot \boldsymbol{\delta}_{C}^{(t + 1)}$

$\frac{\partial L}{\partial\boldsymbol{W}_{f}} = {\sum\limits_{t = 1}^{T}\left( \frac{\partial\boldsymbol{C}_{t}}{\partial\boldsymbol{z}^{(t)}} \right)^{T}}\frac{\partial L}{\partial\boldsymbol{C}_{t}}\left( \boldsymbol{h}^{(t – 1)} \right)^{T}$

$d\boldsymbol{C}^{(t)} = \boldsymbol{C}^{(t – 1)} \odot d\boldsymbol{f}^{(t)} = diag\left( \boldsymbol{C}^{({t – 1})} \right)\left( {\left( {\boldsymbol{f}^{(t)} \odot \left( {1 – \boldsymbol{f}^{(t)}} \right)} \right) \odot d\boldsymbol{z}^{(t)}} \right) = diag\left( \boldsymbol{C}^{({t – 1})} \right)\left( {diag\left( {\boldsymbol{f}^{(t)} \odot \left( {1 – \boldsymbol{f}^{(t)}} \right)} \right)d\boldsymbol{z}^{(t)}} \right) = diag\left( {\boldsymbol{f}^{(t)} \odot \left( {1 – \boldsymbol{f}^{(t)}} \right) \odot \boldsymbol{C}^{({t – 1})}} \right)d\boldsymbol{z}^{(t)}$

$\frac{\partial L}{\partial\boldsymbol{W}_{f}} = {\sum\limits_{t = 1}^{T}\left\lbrack {\boldsymbol{\delta}_{C}^{(t)} \odot \boldsymbol{C}^{(t – 1)} \odot \boldsymbol{f}^{(t)} \odot \left\lbrack {1 – \boldsymbol{f}^{(t)}} \right\rbrack} \right\rbrack}\left( \boldsymbol{h}^{(t – 1)} \right)^{T}$

## 5.3 LSTM 能改善梯度消失的原因

LSTM 中梯度的传播有很多条路径，但$\boldsymbol{C}^{(t)} = \boldsymbol{C}^{(t – 1)}\bigodot\boldsymbol{f}^{(t)} + \boldsymbol{a}^{(t)}\bigodot\boldsymbol{i}^{(t)}$这条路径上只有逐元素相乘和相加的操作，梯度流最稳定；但是其他路径上梯度流与普通 RNN 类似，照样会发生相同的权重矩阵反复连乘。

## 参考资料

（欢迎转载，转载请注明出处。欢迎留言或沟通交流： [email protected]