## 2  问题公式化

$$公式1：X[l,k]=S[l,k]+D[l,k]$$

$$公式2：\operatorname{IRM}[l, k]=\sqrt{\frac{|S[l, k]|^{2}}{|S[l, k]|^{2}+|D[l, k]|^{2}}}$$

$$公式3：\operatorname{PSM}[l, k]=\frac{|S[l, k]|}{|X[l, k]|} \cos \left[\theta_{S[l, k]-X[l, k]}\right]$$

## 3.2  TF注意力模块

$$公式4：\tilde{\mathbf{Y}}=\mathbf{Y} \odot \mathbf{T F}_{\boldsymbol{A}}$$

$$公式5：\mathbf{Z}_{\mathbf{F}}(k)=\frac{1}{L} \sum_{l=1}^{L} \mathbf{Y}(l, k)$$

$$公式6：\mathbf{Z}_{\mathbf{T}}(l)=\frac{1}{d_{\text {model }}} \sum_{k=1}^{d_{\text {model }}} \mathbf{Y}(l, k)$$

$$公式7：\mathbf{F}_{\mathbf{A}}=\sigma\left(f_{2}^{F A}\left(\delta\left(f_{1}^{F A}\left(\mathbf{Z}_{\mathbf{F}}\right)\right)\right)\right)$$

$$公式8：\mathbf{T}_{\mathbf{A}}=\sigma\left(f_{2}^{T A}\left(\delta\left(f_{1}^{T A}\left(\mathbf{Z}_{\mathbf{T}}\right)\right)\right)\right)$$

$$公式9：\mathbf{T F}_{\mathbf{A}}=\mathbf{T}_{\mathbf{A}} \otimes \mathbf{F}_{\mathbf{A}}$$

$$公式10：\mathbf{T F}_{\mathbf{A}}=\mathbf{T}_{\mathbf{A}}(l)*\mathbf{F}_{\mathbf{A}}(l)$$

## 参考文献

[1] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013.

[2] Q. Zhang, M. Wang, Y. Lu, L. Zhang, and M. Idrees, A novel fast nonstationary noise tracking approach based on mmse spectral power estimator, Digital Signal Processing, vol. 88, pp. 41 52, 2019.

[3] Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Trans. Acoust. , Speech, Signal Process. , vol. ASSP-32, no. 6, pp. 1109 1121, Dec. 1984.

[4] Q. Zhang, M. Wang, Y. Lu, M. Idrees, and L. Zhang, Fast nonstationary noise tracking based on log-spectral power mmse estimator and temporal recursive averaging, IEEE Access, vol. 7, pp. 80 985 80 999, 2019.

[5] S. Pascual, A. Bonafonte, and J. Serr`a, SEGAN: Speech enhancement generative adversarial network, Proc. INTERSPEECH, pp. 3642 3646, 2017.

[6] Y. Luo and N. Mesgarani, Conv-tasnet: Surpassing ideal time frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 27, no. 8, pp. 1256 1266, 2019.

[7] Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 22, no. 12, pp. 1849 1858, 2014.

[8] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proc. ICASSP, 2015, pp. 708 712.

[9] D. S. Williamson, Y. Wang, and D. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 24, no. 3, pp. 483 492, 2015.

[10] J. Chen and D. Wang, Long short-term memory for speaker generalization in supervised speech separation, The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4705 4714, 2017.

[11] S. Bai, J. Z. Kolter, and V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271, 2018.

[12] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, DeepMMSE: A deep learning approach to mmse-based noise power spectral density estimation, IEEE/ACM Trans. Audio,  speech, Lang. Process. , vol. 28, pp. 1404 1415, 2020.

[13] K. Tan, J. Chen, and D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement, IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 27, no. 1, pp. 189 198, 2018.

[14] A. Pandey and D. Wang, TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain, in Proc. ICASSP, 2019, pp. 6875 6879.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. NIPS, 2017, pp. 5998 6008.

[16] J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, in Proc. CVPR, 2018, pp. 7132 7141.

[17] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, CBAM: Convolutional block attention module, in Proc. ECCV, 2018, pp. 3 19.

[18] V. A. Trinh, B. McFee, and M. I. Mandel, Bubble cooperative networks for identifying important speech cues, Interspeech 2018, 2018.

[19] Q. Zhang, Q. Song, A. Nicolson, T. Lan, and H. Li, Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement, in Proc. Interspeech 2021, 2021, pp. 166 170.

[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an asr corpus based on public domain audio books, in Proc. ICASSP, 2015, pp. 5206 5210.

[21] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms, in Proc. INTERSPEECH, 2010.

[22] G. Hu, 100 nonspeech environmental sounds, The Ohio State University, Department of Computer Science and Engineering, 2004.

[23] F. Saki, A. Sehgal, I. Panahi, and N. Kehtarnavaz, Smartphone-based real-time classification of noise signals using subband features and random forest classifier, in Proc. ICASSP, 2016, pp. 2204 2208.

[24] F. Saki and N. Kehtarnavaz, Automatic switching between noise classification and speech enhancement for hearing aid devices, in Proc. EMBC, 2016, pp. 736 739.

[25] H. J. Steeneken and F. W. Geurtsen, Description of the rsg-10 noise database, report IZF, vol. 3, p. 1988, 1988. [26] J. Salamon, C. Jacoby, and J. P. Bello, A dataset and taxonomy for urban sound research, in Proc. ACM-MM, 2014, pp. 1041 1044.

[27] D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv:1510.08484, 2015.

[28] Y. Zhao, D. Wang, B. Xu, and T. Zhang, Monaural speech dereverberation using temporal convolutional networks with self attention,  IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 28, pp. 1598 1607, 2020.

[29] A. Nicolson and K. K. Paliwal, Masked multi-head selfattention for causal speech enhancement, Speech Communication, vol. 125, pp. 80 96, 2020.

[30] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014. [31] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, Understanding the difficulty of training transformers, in Proc. EMNLP, 2020, pp. 5747 5763.

[32] R. I.-T. P. ITU, 862.2: Wideband extension to recommendation P. 862 for the assessment of wideband telephone networks and speech codecs. ITU-Telecommunication standardization sector, 2007.

[33] J. Jensen and C. H. Taal, An algorithm for predicting the intelligibility of speech masked by modulated noise maskers, IEEE/ACM Trans. Audio, speech, Lang. Process. , vol. 24, no. 11, pp. 2009 2022, 2016.

[34] Y. Hu and P. C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio, Speech, Lang. process. , vol. 16, no. 1, pp. 229 238, 2007.