## 2  问题表述

$$公式1：X[t,k]=S[t,k]+N[t,k]$$

$$公式2：|\hat{S}[t,k]|=G[t,k]|X[t,k]|$$

$$公式3：G[t, k]=n(g(f(|X[l, k]|)) ; \Theta), l \leq t$$

## 3  最先进的在线降噪技术

$$公式4：L(\vec{G} ; \vec{S}, \vec{X})=\operatorname{mean}\left(\|\vec{S}-\vec{G} \odot \vec{X}\|_{2}^{2}\right)$$

## 4.1  特征表示

$$公式5：f_{L P S}(|X[t, k]|)=\log \left(\max \left(|X[t, k]|^{2}, 10^{-12}\right)\right)$$

$$公式6：g_{G}(f(|X[t, k]|))=\left[f(|X[t, k]|)-\mu_{f(x)}[k]\right] / \sigma_{f(x)}[k]$$

$$公式7：\mu_{f(x)}[t, k]=c \mu_{f(x)}[t-1, k]+(1-c) f(|X[t, k]|)$$

$$公式8：\sigma_{f(x)}^{2}[t, k]=c \sigma_{f(x)}^{2}[t-1, k]+(1-c) f(|X[t, k]|)^{2}$$

$$公式9：g_{F D}(f(|X[t, k]|))=\frac{f(|X[t, k]|)-\mu_{f(x)}[t, k]}{\sqrt{\sigma_{f(x)}^{2}[t, k]-\mu_{f(x)}^{2}[t, k]}}$$

## 4.3  损失函数

$$公式10：L_{\text {speech }}=\operatorname{mean}\left(\left\|\vec{S}_{\mathrm{SA}}-(\vec{G} \odot \vec{S})_{\mathrm{SA}}\right\|_{2}^{2}\right)$$

$$公式11：L_{\text {noise }}=\operatorname{mean}\left(\|\vec{G} \odot \vec{N}\|_{2}^{2}\right)$$

$$公式12：L\left(\vec{G} ; \vec{S}_{\mathrm{SA}}, \vec{N}\right)=\alpha L_{\text {speech }}+(1-\alpha) L_{\text {noise }}$$

$$公式13：\alpha=\frac{SNR}{SNR+\beta }$$

## 7  参考文献

[1] P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2013.

[2] J. Benesty, S. Makino, and J. Chen, Eds., Speech Enhancement, Springer, 2005.

[3] Y.Wang, A. Narayanan, and D.Wang, On training targets for supervised speech separation, IEEE/ACM Trans. on audio, speech, and language processing, vol. 22, no. 12, pp. 1849 1858, 2014.

[4] I. Cohen and B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE signal processing letters, vol. 9, no. 1, pp. 12 15, 2002.

[5] I. Cohen and B. Berdugo, Speech enhancement for nonstationary noise environments, Signal processing, vol. 81, no. 11, pp. 2403 2418, 2001.

[6] Y. Ephraim and D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. on acoustics, speech, and signal processing, vol. 32, no. 6, pp. 1109 1121, 1984.

[7] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113 120, 1979.

[8] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation, ACM Trans. on Graphics (TOG), vol. 37, no. 4, pp. 112, 2018.

[9] S. Pascual, A. Bonafonte, and J. Serr`a, SEGAN: Speech enhancement generative adversarial network, in ISCA INTERSPEECH 2017, 2017, pp. 3642 3646.

[10] J.-M. Valin, A hybrid DSP/deep learning approach to realtime full-band speech enhancement, in 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), 2018, pp. 1 5.

[11] K. Tan and D. Wang, A convolutional recurrent neural network for real-time speech enhancement., in ISCA INTERSPEECH, 2018, pp. 3229 3233.

[12] Y. Xia and R. Stern, A priori SNR estimation based on a recurrent neural network for robust speech enhancement, in ISCA INTERSPEECH, 2018, pp. 3274 3278.

[13] S. Braun, K. Kowalczyk, and E. Habets, Residual noise control using a parametric multichannel wiener filter, in IEEE ICASSP, 2015, pp. 1 5.

[14] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443 445, 1985.

[15] T. Esch and P. Vary, Efficient musical noise suppression for speech enhancement system, in IEEE ICASSP, 2009, pp. 1 5.

[16] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement, in IEEE Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017, pp. 136 140.

[17] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, Multiobjective learning and mask-based post-processing for deep neural network based speech enhancement, in ISCA INTERSPEECH 2015, pp. 1508 1512.

[18] F. G. Germain, Q. Chen, and V. Koltun, Speech Denoising with Deep Feature Losses, in Proc. Interspeech 2019, 2019, pp. 2723 2727.

[19] J. M. Mart ın-Do nas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, A deep learning loss function based on the perceptual evaluation of the speech quality, IEEE Signal processing letters, vol. 25, no. 11, pp. 1680 1684, 2018.

[20] Y. Zhao, B. Xu, R. Giri, and T. Zhang, Perceptually guided speech enhancement using deep neural networks, in IEEE ICASSP, 2018, pp. 5074 5078.

[21] A. Kumar and D. Florencio, Speech enhancement in multiplenoise conditions using deep neural networks, in ISCA INTERSPEECH 2016, 2016, pp. 3738 3742.

[22] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, A Scalable Noisy Speech Dataset and Online Subjective Test Framework, in ISCA INTERSPEECH 2019, 2019, pp. 1816 1820.

[23] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724 1734.

[24] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol. 9, no. 8, pp. 1735 1780, 1997.

[25] H. Zhao, S. Zarar, I. Tashev, and C.-H. Lee, Convolutionalrecurrent neural networks for speech enhancement, in IEEE ICASSP, 2018, pp. 2401 2405.

[26] D. Liu, P. Smaragdis, and M. Kim, Experiments on deep learning for speech denoising, in ISCA INTERSPEECH, 2014.

[27] Z. Xu, S. Elshamy, Z. Zhao, and T. Fingscheidt, Components loss for neural networks in mask-based speech enhancement, arXiv preprint arXiv:1908.05087, 2019.

[28] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), 2001, vol. 2, pp. 749 752.

[29] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214 4217.

[30] J. Le Roux, S.Wisdom, H. Erdogan, and J. R. Hershey, SDR half-baked or well done?, in IEEE ICASSP, 2019, pp. 626 630.

[31] I. J. Tashev, Sound capture and processing: practical approaches, John Wiley &amp; Sons, 2009.