## 二、多模态融合办法

a) 简单操作融合办法

l Concatenation拼接操作可以用来把低层的输入特征[1][2][3]或者高层的特征(通过预训练模型提取出来的特征)[3][4][5]之间相互结合起来。

l Weighted sum 对于权重为标量的加权求和方法，这种迭代的办法要求预训练模型产生的向量要有确定的维度，并且要按一定顺序排列并适合element-wise 加法[6]。为了满足这种要求可以使用全连接层来控制维度和对每一维度进行重新排序。

b) 基于注意力机制的融合办法

c) 基于双线性池化的融合办法

#### 双线性池化的因式分解

MUTAN是一种基于多模态张量的Tucker decomposition方法，使用Tucker分解[39]将原始的三维权量张量算子分解为低维核心张量和MLB使用的三个二维权量矩阵[40]。核心张量对不同形式的相互作用进行建模。MCB可以看作是一个具有固定对角输入因子矩阵和稀疏固定核张量的MUTAN, MLB可以看作是一个核张量为单位张量的MUTAN。

## 三、总结

Zhang, C., Yang, Z., He, X., & Deng, L. (2020). Multimodal intelligence: Representation learning, information fusion, and applications .IEEE Journal of Selected Topics in Signal Processing.

## 参考文献：

[1] B. Nojavanasghari, D. Gopinath, J. Koushik, B. T., and L.-P. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proc. ICMI, 2016

[2] H. Wang, A. Meghawat, L.-P. Morency, and E. Xing, “Select-additive learning: Improving generalization in multimodal sentiment analysis,” in Proc. ICME, 2017.

[3] A. Anastasopoulos, S. Kumar, and H. Liao, “Neural language modeling with visual features,” in arXiv:1903.02930, 2019.

[4] V. Vielzeuf, A. Lechervy, S. Pateux, and F. Jurie, “CentralNet: A multilayer approach for multimodal fusion,” in Proc. ECCV, 2018.

[5] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, “Simple baseline for visual question answering,” in arXiv:1512.02167, 2015.

[6] J.-M. Pe ́rez-Ru ́a, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie, “MFAS: Multimodal fusion architecture search,” in Proc. CVPR, 2019.

[7] B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in Proc. ICLR, 2017.

[8] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, F.-F. Li, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proc. ECCV, 2018.

[9] J.-M. Pe ́rez-Ru ́a, M. Baccouche, and S. Pateux, “Efficient progressive neural architecture search,” in Proc. BMVC, 2019.

[10] X. Yang, P. Molchanov, and J. Kautz, “Multilayer and multimodal fusion of deep neural networks for video classification,” in Proc. ACM MM, 2016.

[11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.

[12] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” in arXiv:1410.5401, 2014.

[13] Y. Zhu, O. Groth, M. Bernstein, and F.-F. Li, “Visual7W: Grounded question answering in images,” in Proc. CVPR, 2016.

[14] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. ICML, 2015.

[15] K. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in Proc. CVPR, 2016.

[16] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. CVPR, 2016.

[17] H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in Proc. ECCV, 2016.

[18] C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in Proc. ICML, 2016.

[19] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. CVPR, 2018.

[20] P. Lu, H. Li, W. Zhang, J. Wang, and X. Wang, “Co-attending free- form regions and detections with multi-modal multiplicative feature embedding for visual question answering,” in Proc. AAAI, 2018.

[21] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training,” in Proc. CVPR, 2019.

[22] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Proc. NIPS, 2016.

[23] H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. CVPR, 2017.

[24] H. Fan and J. Zhou, “Stacked latent attention for multimodal reasoning,” in Proc. CVPR, 2018.

[25] A. Osman and W. Samek, “DRAU: Dual recurrent attention units for visual question answering,” Computer Vision and Image Understanding, vol. 185, pp. 24–30, 2019.

[26] I. Schwartz, A. Schwing, and T. Hazan, “High-order attention models for visual question answering,” in Proc. NIPS, 2017.

[27] J. Arevalo, T. Solorio, M. Montes-y Go ́mez, and F. Gonza ́lez, “Gated multimodal units for information fusion,” in Proc. ICLR, 2017.

[28] J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang, “Multimodal residual learning for visual QA,” in Proc. NIPS, 2016.

[29] H. Noh, P. Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” in Proc. CVPR, 2016.

[30] J. Tenenbaum and W. Freeman, “Separating style and content with bilinear models,” Neural Computing, vol. 12, pp. 1247–1283, 2000.

[31] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. EMNLP, 2017.

[32] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in Proc. CVPR, 2016.

[33] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in Proc. ICALP, 2012.

[34] N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicit feature maps,” in Proc. SIGKDD, 2013.

[35] A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. EMNLP, 2016.

[36] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in Proc. ICLR, 2017.

[37] Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in Proc. ICCV, 2017.

[38] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, pp. 5947–5959, 2018.

[39] L. Tucker, “Some mathematical notes on three-mode factor analy,” Psychometrika, vol. 31, pp. 279–311, 1966.

[40] H. Ben-younes, R. Cadene, M. Cord, and N. Thome, “MUTAN: Multimodal tucker fusion for visual question answering,” in Proc. ICCV, 2017.

[41] L. Lathauwer, “Decompositions of a higher-order tensor in block termspart II: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, pp. 1033–1066, 2008.

[42] H. Ben-younes, R. Cadene, N. Thome, and M. Cord, “BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection,” in Proc. AAAI, 2019.

[43] Z. Liu, Y. Shen, V. Lakshminarasimhan, P. Liang, A. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. ACL, 2018.

[44] A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in Proc. EMNLP, 2016.

[45] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” in Proc. ICLR, 2017.

[46] Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in Proc. ICCV, 2017.

[47] L. Tucker, “Some mathematical notes on three-mode factor analy,” Psychometrika, vol. 31, pp. 279–311, 1966.

[48] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” in Proc. NeurIPS, 2018.