p 1 ⋅ log ⁡ 1 p 1 + p 2 ⋅ log ⁡ 1 p 2 (1) p_1 \cdot \log \frac{1}{p_1} + p_2 \cdot \log \frac{1}{p_2} \tag{1} lo g + lo g ( 1 )

∑ i p i log ⁡ 1 p i (2) \sum_i p_i \log \frac{1}{p_i} \tag{2} i ∑ ​ p i ​ lo g ( 2 )

∑ i p i log ⁡ 1 p i = ∑ i ( p i ⋅ ( log ⁡ 1 − log ⁡ p i ) ) = ∑ i − p i log ⁡ p i = − ∑ i p i log ⁡ p i (3) \begin{aligned} \sum_i p_i \log \frac{1}{p_i} &= \sum_i (p_i \cdot (\log1 – \log p_i)) \\ &= \sum_i – p_i \log p_i \\ &= – \sum_i p_i \log p_i \end{aligned} \tag{3} i ∑ ​ p i ​ lo g ​ = i ∑ ​ ( p i ​ ⋅ ( lo g 1 − lo g p i ​ ) ) = i ∑ ​ − p i ​ lo g p i ​ = − i ∑ ​ p i ​ lo g p i ​ ​ ( 3 )

## 极大似然估计

P ( C 1 , C 2 , ⋯ , C 10 ∣ θ ) = ∏ i = 1 10 P ( C i ∣ θ ) (4) P(C_1,C_2,\cdots,C_{10}|\theta) = \prod_{i=1}^{10} P(C_i|\theta) \tag{4} P ( C 1 ​ , C 2 ​ , ⋯ , C 1 0 ​ ∣ θ ) = i = 1 ∏ 1 0 ​ P ( C i ​ ∣ θ ) ( 4 )

C i ∈ { 0 , 1 } C_i \in \{0,1\} { 0 , 1 } 是第
i i i 次抛硬币的结果，整个式子说的是由参数
θ \theta θ

C 1 , C 2 , ⋯ , C 10 C_1,C_2,\cdots,C_{10} C 1 ​ , C 2 ​ , ⋯ , C 1 0 ​

## 极大似然法

P ( x 1 , x 2 , x 3 , ⋯ , x n ∣ W , b ) (5) P(x_1,x_2,x_3,\cdots,x_n|W,b) \tag{5} P ( x 1 ​ , x 2 ​ , x 3 ​ , ⋯ , x n ​ ∣ W , b ) ( 5 )

n n n 这些图片的个数；
x i ∈ { 0 , 1 } x_i \in \{0,1\} { 0 , 1 }

1 1 1

P ( x 1 , x 2 , x 3 , ⋯ , x n ∣ W , b ) = ∏ i = 1 n P ( x i ∣ W , b ) (6) P(x_1,x_2,x_3,\cdots,x_n|W,b) =\prod_{i=1}^n P(x_i|W,b) \tag{6} P ( x 1 ​ , x 2 ​ , x 3 ​ , ⋯ , x n ​ ∣ W , b ) = i = 1 ∏ n ​ P ( x i ​ ∣ W , b ) ( 6 )

W , b W,b W , b

P ( x 1 , x 2 , x 3 , ⋯ , x n ∣ W , b ) = ∏ i = 1 n P ( x i ∣ y i ) (7) P(x_1,x_2,x_3,\cdots,x_n|W,b) =\prod_{i=1}^n P(x_i|y_i) \tag{7} P ( x 1 ​ , x 2 ​ , x 3 ​ , ⋯ , x n ​ ∣ W , b ) = i = 1 ∏ n ​ P ( x i ​ ∣ y i ​ ) ( 7 )

x i x_i x i ​

，我们可以得到不同的概率值

y i y_i y i ​

P ( x 1 , x 2 , x 3 , ⋯ , x n ∣ W , b ) = ∏ i = 1 n y i x i ( 1 − y i ) 1 − x i (8) P(x_1,x_2,x_3,\cdots,x_n|W,b) = \prod_{i=1}^n y_i^{x_i}(1-y_i)^{1-x_i} \tag{8} P ( x 1 ​ , x 2 ​ , x 3 ​ , ⋯ , x n ​ ∣ W , b ) = i = 1 ∏ n ​ y i x i ​ ​ ( 1 − y i ​ ) 1 − x i ​ ( 8 )

log ⁡ ( ∏ i = 1 n y i x i ( 1 − y i ) 1 − x i ) = ∑ i = 1 n log ⁡ ( y i x i ( 1 − y i ) 1 − x i ) = ∑ i = 1 n ( x i ⋅ log ⁡ y i + ( 1 − x i ) ⋅ log ⁡ ( 1 − y i ) ) (9) \begin{aligned} \log \left( \prod_{i=1}^n y_i^{x_i}(1-y_i)^{1-x_i} \right )&= \sum_{i=1}^n \log \left(y_i^{x_i}(1-y_i)^{1-x_i} \right) \\ &= \sum_{i=1}^n \left(x_i \cdot \log y_i + (1-x_i)\cdot \log (1-y_i) \right) \\ \end{aligned} \tag{9} lo g ( i = 1 ∏ n ​ y i x i ​ ​ ( 1 − y i ​ ) 1 − x i ​ ) ​ = i = 1 ∑ n ​ lo g ( y i x i ​ ​ ( 1 − y i ​ ) 1 − x i ​ ) = i = 1 ∑ n ​ ( x i ​ ⋅ lo g y i ​ + ( 1 − x i ​ ) ⋅ lo g ( 1 − y i ​ ) ) ​ ( 9 )

min ⁡ − ∑ i = 1 n ( x i ⋅ log ⁡ y i + ( 1 − x i ) ⋅ log ⁡ ( 1 − y i ) ) (10) \min – \sum_{i=1}^n \left(x_i \cdot \log y_i + (1-x_i)\cdot \log (1-y_i) \right) \tag{10} min − i = 1 ∑ n ​ ( x i ​ ⋅ lo g y i ​ + ( 1 − x i ​ ) ⋅ lo g ( 1 − y i ​ ) ) ( 1 0 )

KL散度，也被称为相对熵，是用来衡量两个分布的距离，设 P P P 和 Q Q Q 是两个概率分布，则 P P P 对 Q Q Q 的相对熵为：

D K L ( P ∣ ∣ Q ) = ∑ i P ( i ) log ⁡ P ( i ) Q ( i ) (11) D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} \tag{11} D K L ​ ( P ∣ ∣ Q ) = i ∑ ​ P ( i ) lo g Q ( i ) P ( i ) ​ ( 1 1 )

i i i

#### 性质

1. 不具备对称性，即 D ( P ∣ ∣ Q ) ≠ D ( Q ∣ ∣ P ) D(P||Q)

1. eq D(Q||P) D ( P ∣ ∣ Q ) ​ = D ( Q ∣ ∣ P )

1. 非负性，即 D ( P ∣ ∣ Q ) ≥ 0 D(P||Q) \geq 0 D ( P ∣ ∣ Q ) ≥ 0

​ 假设公平硬币的抛掷结果为：HHTHHTTHTHTH

P ( 观察结果 ∣ 公平硬币 ) P ( 观察结果 ∣ 偏差硬币 ) \frac{P(\text{观察结果}|\text{公平硬币})}{P(\text{观察结果}|\text{偏差硬币})} P ( 观察结果 ∣ 偏差硬币 ) P ( 观察结果 ∣ 公平硬币 ) ​

P ( 观察结果 ∣ 硬币1 ) = p 1 N H p 2 N T P(\text{观察结果}|\text{硬币1}) = p_1^{N_H}\color{red}p_2^{N_T} P ( 观察结果 ∣ 硬币 1 ) = p 1 N H ​ ​ p 2 N T ​ ​

P ( 观察结果 ∣ 硬币2 ) = q 1 N H q 2 N T P(\text{观察结果}|\text{硬币2}) = q_1^{N_H}\color{red}q_2^{N_T} P ( 观察结果 ∣ 硬币 2 ) = q 1 N H ​ ​ q 2 N T ​ ​

P ( 观察结果 ∣ 真实硬币 ) P ( 观察结果 ∣ 硬币2 ) = p 1 N H p 2 N T q 1 N H q 2 N T (12) \frac{P(\text{观察结果}|\text{真实硬币})}{P(\text{观察结果}|\text{硬币2})} = \frac{p_1^{N_H}\color{red}p_2^{N_T}}{q_1^{N_H}\color{red}q_2^{N_T}} \tag{12} P ( 观察结果 ∣ 硬币 2 ) P ( 观察结果 ∣ 真实硬币 ) ​ = q 1 N H ​ ​ q 2 N T ​ ​ p 1 N H ​ ​ p 2 N T ​ ​ ​ ( 1 2 )

1 N log ⁡ ( p 1 N H p 2 N T q 1 N H q 2 N T ) = 1 N log ⁡ p 1 N H + 1 N log ⁡ p 2 N T − 1 N log ⁡ q 1 N H − 1 N log ⁡ q 2 N T = p 1 log ⁡ p 1 + p 2 log ⁡ p 2 − p 1 log ⁡ q 1 − p 2 log ⁡ q 2 = p 1 log ⁡ p 1 q 1 + p 2 log ⁡ p 2 q 2 \begin{aligned} \frac{1}{N}\log \left( \frac{p_1^{N_H}\color{red}p_2^{N_T}}{q_1^{N_H}\color{red}q_2^{N_T}} \right) &= \frac{1}{N}\log p_1^{N_H} + \frac{1}{N}\log \color{red}p_2^{N_T} \color{black} – \frac{1}{N}\log q_1^{N_H} -\frac{1}{N}\log \color{red} q_2^{N_T} \\ &= p_1\log p_1 + p_2 \log \color{red}p_2 \color{black} – p_1 \log q_1 – \color{red}p_2 \color{black}\log \color{red}q_2 \\ &= p_1 \log \frac{p_1}{q_1} + \color{red}p_2 \color{black}\log \frac{\color{red}p_2}{\color{red}q_2} \end{aligned} lo g ( q 1 N ​ q 2 N ​ p 1 N ​ p 2 N ​ ​ ) ​ = lo g p 1 N H ​ ​ + lo g p 2 N T ​ ​ − lo g q 1 N H ​ ​ − lo g q 2 N T ​ ​ = p 1 ​ lo g p 1 ​ + p 2 ​ lo g p 2 ​ − p 1 ​ lo g q 1 ​ − p 2 ​ lo g q 2 ​ = p 1 ​ lo g q 1 ​ p 1 ​ ​ + p 2 ​ lo g q 2 ​ p 2 ​ ​ ​

KL散度非常适用于深度学习的场景，因为深度学习模型基本上是关于为已知样本的真实分布建模。

H ( P ∗ ∣ P ) = − ∑ i P ∗ ( i ) log ⁡ P ( i ) (13) H(P^*|P)=- \sum_i P^*(i) \log P(i) \tag{13} H ( P ∗ ∣ P ) = − i ∑ ​ P ∗ ( i ) lo g P ( i ) ( 1 3 )

P ∗ P^* P ∗ 表示真实分布；
P P P

i i i

## KL散度和交叉熵

D K L ( P ∗ ∣ ∣ P ) = D K L ( P ∗ ( y ∣ x i ) ∣ ∣ P ( y ∣ x i ; θ ) (14) D_{KL}(P^*||P) =D_{KL}\left ( P^* (y|x_i) || P(y|x_i;\theta\right) \tag{14} D K L ​ ( P ∗ ∣ ∣ P ) = D K L ​ ( P ∗ ( y ∣ x i ​ ) ∣ ∣ P ( y ∣ x i ​ ; θ ) ( 1 4 )

D K L ( P ∗ ∣ ∣ P ) = ∑ y P ∗ ( y ∣ x i ) log ⁡ P ∗ ( y ∣ x i ) P ( y ∣ x i ; θ ) = ∑ y P ∗ ( y ∣ x i ) [ log ⁡ P ∗ ( y ∣ x i ) − log ⁡ P ( y ∣ x i ; θ ) ] = ∑ y P ∗ ( y ∣ x i ) log ⁡ P ∗ ( y ∣ x i ) − ∑ y P ∗ ( y ∣ x i ) log ⁡ P ( y ∣ x i ; θ ) (15) \begin{aligned} D_{KL}(P^*||P) &= \sum_y P^* (y|x_i) \log \frac{P^*(y|x_i)}{P(y|x_i;\theta)} \\ &=\sum_y P^* (y|x_i) \left [\log P^*(y|x_i) – \log P(y|x_i;\theta) \right] \\ &= \sum_y P^* (y|x_i)\log P^*(y|x_i) – \sum_y P^* (y|x_i) \log P(y|x_i;\theta) \\ \end{aligned} \tag{15} D K L ​ ( P ∗ ∣ ∣ P ) ​ = y ∑ ​ P ∗ ( y ∣ x i ​ ) lo g P ( y ∣ x i ​ ; θ ) P ∗ ( y ∣ x i ​ ) ​ = y ∑ ​ P ∗ ( y ∣ x i ​ ) [ lo g P ∗ ( y ∣ x i ​ ) − lo g P ( y ∣ x i ​ ; θ ) ] = y ∑ ​ P ∗ ( y ∣ x i ​ ) lo g P ∗ ( y ∣ x i ​ ) − y ∑ ​ P ∗ ( y ∣ x i ​ ) lo g P ( y ∣ x i ​ ; θ ) ​ ( 1 5 )

P ∗ ( y ∣ x i ) log ⁡ P ∗ ( y ∣ x i ) P^* (y|x_i)\log P^*(y|x_i) P ∗ ( y ∣ x i ​ ) lo g P ∗ ( y ∣ x i ​ ) 与参数
θ \theta θ

− P ∗ ( y ∣ x i ) log ⁡ P ( y ∣ x i ; θ ) -P^* (y|x_i)\log P(y|x_i;\theta) − P ∗ ( y ∣ x i ​ ) lo g P ( y ∣ x i ​ ; θ )

arg ⁡ min ⁡ θ D K L ( P ∗ ∣ ∣ P ) ≡ arg ⁡ min ⁡ θ − ∑ i P ∗ ( y ∣ x i ) log ⁡ P ( y ∣ x i ; θ ) (16) \arg\,\min_{\theta}D_{KL}(P^*||P) \equiv \arg\,\min_{\theta} – \sum_i P^* (y|x_i) \log P(y|x_i;\theta) \tag{16} ar g θ min ​ D K L ​ ( P ∗ ∣ ∣ P ) ≡ ar g θ min ​ − i ∑ ​ P ∗ ( y ∣ x i ​ ) lo g P ( y ∣ x i ​ ; θ ) ( 1 6 )

arg ⁡ min ⁡ θ D K L ( P ∗ ∣ ∣ P ) ≡ arg ⁡ min ⁡ θ H ( P ∗ , P ) (17) \arg\,\min_{\theta}D_{KL}(P^*||P) \equiv \arg\,\min_{\theta} H(P^*,P) \tag{17} ar g θ min ​ D K L ​ ( P ∗ ∣ ∣ P ) ≡ ar g θ min ​ H ( P ∗ , P ) ( 1 7 )

## 交叉熵损失

l o s s = − ∑ j = 1 n y j log ⁡ y ^ j (18) loss=-\sum_{j=1}^n y_j \log \hat y_j \tag{18} l o s s = − j = 1 ∑ n ​ y j ​ lo g y ^ ​ j ​ ( 1 8 )

L = − ∑ i = 1 m ∑ j = 1 n y i j log ⁡ y ^ i j (19) \mathcal L = -\sum_{i=1}^m \sum_{j=1}^n y_{ij} \log \hat y_{ij} \tag{19} − i = 1 ∑ m ​ j = 1 ∑ n ​ y i j ​ lo g y ^ ​ i j ​ ( 1 9 )

m m m

n n n

l o s s = − [ y 1 log ⁡ y ^ 1 + ( 1 − y 1 ) log ⁡ ( 1 − y ^ 1 ) ] (20) loss = – \left[ y_1 \log \hat y_1 + (1-y_1)\log (1-\hat y_1) \right] \tag{20} l o s s = − [ y 1 ​ lo g y ^ ​ 1 ​ + ( 1 − y 1 ​ ) lo g ( 1 − y ^ ​ 1 ​ ) ] ( 2 0 )

L = − ∑ i = 1 m [ y i log ⁡ y ^ i + ( 1 − y i ) log ⁡ ( 1 − y ^ i ) ] (21) \mathcal L = – \sum_{i=1}^m \left [ y_i \log \hat y_i + (1-y_i)\log(1-\hat y_i) \right] \tag{21} − i = 1 ∑ m ​ [ y i ​ lo g y ^ ​ i ​ + ( 1 − y i ​ ) lo g ( 1 − y ^ ​ i ​ ) ] ( 2 1 )

1 1 1

，负例为

0 0 0

。因此上式的两个相加项只会有一个存在。

L = − ∑ i = 1 m ∑ j = 1 n y i j log ⁡ y ^ i j \mathcal L = -\sum_{i=1}^m \sum_{j=1}^n y_{ij} \log \hat y_{ij} − i = 1 ∑ m ​ j = 1 ∑ n ​ y i j ​ lo g y ^ ​ i j ​

## 均方误差和交叉熵

import numpy as np
import matplotlib.pyplot as plt
def cross_entropy(y_hat, y):
return -np.log(y_hat) if y == 1 else -np.log(1 - y_hat)

y_hat = np.arange(0.01,1,0.01)
plt.plot(y_hat, cross_entropy(y_hat, 1), label='y=1')
plt.plot(y_hat, cross_entropy(y_hat, 0), label='y=0')
plt.legend()
plt.show()

> cross_entropy(0.1, 1)
2.3025850929940455

def mse(y_hat, y):
return (y - y_hat)**2

plt.plot(y_hat,mse(y_hat, 1) , label='y=1')
plt.plot(y_hat, mse(y_hat, 0), label='y=0')
plt.legend()

> mse(0.1, 1)
0.81

## 交叉熵损失的梯度

softmax函数为:

y ^ i = e z i ∑ k = 1 K e z k \hat y_i = \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}} ∑ k = 1 K ​ e z k ​ e z i ​ ​

∂ y ^ i ∂ z j = ∂ e z i ∑ k = 1 K e z k ∂ z j \frac{\partial \hat y_i}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^K e^{z_k}}}{\partial z_j} ∂ z j ​ ∂ y ^ ​ i ​ ​ = ∂ z j ​ ∂ ∑ k = 1 K ​ e z ​ e z ​ ​ ​

eq j j 。当 i = j i=j j 时, e z i e^{z_i} e z i ​ 对 z j z_j z j ​ 的导数为 e z i e^{z_i} e z i ​ ，否则当 i ≠ j i
eq j j 时，导数为 0 0 0 。

∂ y ^ i ∂ z j = e z i ⋅ ∑ k = 1 K e z k − e z i ⋅ e z j ( ∑ k = 1 m e z k ) 2 = e z i ∑ k = 1 m e z k − e z i ∑ k = 1 m e z k ⋅ e z j ∑ k = 1 m e z k = y ^ i − y ^ i 2 = y ^ i ( 1 − y ^ i ) \begin{aligned} \frac{\partial \hat y_i}{\partial z_j} &= \frac{e^{z_i}\cdot \sum_{k=1}^K e^{z_k} – e^{z_i} \cdot e^{z_j} }{(\sum_{k=1}^m e^{z_k})^2} \\ &= \frac{e^{z_i}}{\sum_{k=1}^m e^{z_k}} – \frac{e^{z_i}}{\sum_{k=1}^m e^{z_k}} \cdot \frac{e^{z_j}}{\sum_{k=1}^m e^{z_k}} \\ &= \hat y_i – \hat y_i^2 = \hat y_i(1 – \hat y_i) \end{aligned} ∂ z j ​ ∂ y ^ ​ i ​ ​ ​ = ( ∑ k = 1 m ​ e z ​ ) 2 e z ​ ⋅ ∑ k = 1 K ​ e z ​ − e z ​ ⋅ e z ​ ​ = ∑ k = 1 m ​ e z ​ e z ​ ​ − ∑ k = 1 m ​ e z ​ e z ​ ​ ⋅ ∑ k = 1 m ​ e z ​ e z ​ ​ = y ^ ​ i ​ − y ^ ​ i 2 ​ = y ^ ​ i ​ ( 1 − y ^ ​ i ​ ) ​

eq j j ，

∂ y ^ i ∂ z j = 0 ⋅ ∑ k = 1 K e z k − e z i ⋅ e z j ( ∑ k = 1 m e z k ) 2 = − e z i ∑ k = 1 m e z k ⋅ e z j ∑ k = 1 m e z k = − y ^ i y ^ j \begin{aligned} \frac{\partial \hat y_i}{\partial z_j} &= \frac{0 \cdot \sum_{k=1}^K e^{z_k} – e^{z_i} \cdot e^{z_j}}{(\sum_{k=1}^m e^{z_k})^2} \\ &= – \frac{e^{z_i}}{\sum_{k=1}^m e^{z_k}} \cdot \frac{e^{z_j}}{\sum_{k=1}^m e^{z_k}} \\ &= – \hat y_i \hat y_j \end{aligned} ∂ z j ​ ∂ y ^ ​ i ​ ​ ​ = ( ∑ k = 1 m ​ e z ​ ) 2 0 ⋅ ∑ k = 1 K ​ e z ​ − e z ​ ⋅ e z ​ ​ = − ∑ k = 1 m ​ e z ​ e z ​ ​ ⋅ ∑ k = 1 m ​ e z ​ e z ​ ​ = − y ^ ​ i ​ y ^ ​ j ​ ​

L = − ∑ k y k log ⁡ y ^ k L = -\sum_k y_k \log \hat y_k − k ∑ ​ y k ​ lo g y ^ ​ k ​

∂ L ∂ z j = ∂ − ( ∑ k y k log ⁡ y ^ k ) z j = ∂ − ( ∑ k y k log ⁡ y ^ k ) ∂ y ^ k ∂ y ^ k ∂ z j = − ∑ k y k 1 y ^ k ∂ y ^ k z j = ( − y k ⋅ y ^ k ( 1 − y ^ k ) 1 y ^ k ) k = j − ∑ k ≠ j y k 1 y ^ k ( − y ^ k y ^ j ) = − y j ( 1 − y ^ j ) − ∑ k ≠ j y k ( − y ^ j ) = − y j + y j y ^ j + ∑ k ≠ j y k ( y ^ j ) = − y j + ∑ k y k ( y ^ j ) = − y j + y ^ j = y ^ j − y j \begin{aligned} \frac{\partial L}{\partial z_j} &= \frac{\partial -(\sum_k y_k \log \hat y_k)}{z_j}\\ &= \frac{\partial -(\sum_k y_k \log \hat y_k)}{\partial \hat y_k} \frac{\partial \hat y_k}{\partial z_j} \\ &= -\sum_k y_k \frac{1}{\hat y_k} \frac{\partial \hat y_k}{z_j} \\ &= \left(-y_k \cdot \hat y_k(1 – \hat y_k) \frac{1}{\hat y_k} \right)_{k=j} – \sum_{k
eq j} y_k \frac{1}{\hat y_k} (-\hat y_k \hat y_j) \\ &= – y_j (1 -\hat y_j) – \sum_{k
eq j} y_k (-\hat y_j) \\ &= – y_j + y_j \hat y_j + \sum_{k
eq j} y_k (\hat y_j) \\ &= – y_j + \sum_{k} y_k (\hat y_j) \\ &= – y_j +\hat y_j \\ &= \hat y_j -y_j \end{aligned} ∂ z j ​ ∂ L ​ ​ = z j ​ ∂ − ( ∑ k ​ y k ​ lo g y ^ ​ k ​ ) ​ = ∂ y ^ ​ k ​ ∂ − ( ∑ k ​ y k ​ lo g y ^ ​ k ​ ) ​ ∂ z j ​ ∂ y ^ ​ k ​ ​ = − k ∑ ​ y k ​ y ^ ​ k ​ 1 ​ z j ​ ∂ y ^ ​ k ​ ​ = ( − y k ​ ⋅ y ^ ​ k ​ ( 1 − y ^ ​ k ​ ) y ^ ​ k ​ 1 ​ ) k = j ​ − k ​ = j ∑ ​ y k ​ y ^ ​ k ​ 1 ​ ( − y ^ ​ k ​ y ^ ​ j ​ ) = − y j ​ ( 1 − y ^ ​ j ​ ) − k ​ = j ∑ ​ y k ​ ( − y ^ ​ j ​ ) = − y j ​ + y j ​ y ^ ​ j ​ + k ​ = j ∑ ​ y k ​ ( y ^ ​ j ​ ) = − y j ​ + k ∑ ​ y k ​ ( y ^ ​ j ​ ) = − y j ​ + y ^ ​ j ​ = y ^ ​ j ​ − y j ​ ​

## Reference

Softmax与Cross-entropy的求导