Multimodal sentiment analysis aims to integrate data such as text, audio and video to get accurate sentiment predictions. Existing methods mainly focus on fusing information from multimodal data. However, modality-specific heterogeneity features interact , and irrelevant and conflicting information across modalities may hinder further performance improvement, for which a Transformer-based adaptive learning multimodal sentiment analysis model is proposed. Firstly, three multilayer perceptrons are utilized to extract multimodal features from multimodal data, then the Transformer separation encoder is introduced to de-entangle the inter-modal sentiment consistency features from the sentiment heterogeneity features, the textual modality among the heterogeneity features is dominated to do the attentional reinforcement to the audio and the video, and finally, the bidirectional sentiment query learning method is designed for the fusion of the sentiment features, Thus it provides a more comprehensive and complementary multimodal emotional representation, enhancing the capability of multimodal emotion recognition. This method was applied to three popular multimodal datasets, CH-SIMS, CMU-MOSI, and CMU-MOSEI. The results showed that the F1 value reached 82.31% on CH-SIMS, which was 0.6% higher than the second-best model; on CMU-MOSI, the F1 value reached 86.55%, which was 0.5% higher than the second-best model; and on CMU-MOSEI, the F1 value reached 86.64%, which was 0.4% higher than the second-best model. A large number of experiments have proved that the model proposed in this paper has certain competitiveness compared with other models.
@artical{s14102025ijsea14101019,
Title = "Adaptive Learning Multimodal Sentiment Analysis based on Transformer",
Journal ="International Journal of Science and Engineering Applications (IJSEA)",
Volume = "14",
Issue ="10",
Pages ="126 - 135",
Year = "2025",
Authors ="Shu Wei-Liang"}