Investigating HuBERT-based Speech Emotion Recognition Generalisation Capability

Li, Letian, Glackin, Cornelius, Cannings, Nigel, Veneziano, Vito, Barker, Jack, Oduola, Olakunle, Woodruff, Chris, Laird, Thea, Laird, James and Sun, Yi (2024) Investigating HuBERT-based Speech Emotion Recognition Generalisation Capability. In: The 23rd International Conference on Artificial Intelligence and Soft Computing 2024, 2024-06-16 - 2024-06-20.

Copy

Transformer-based architectures have made significant progress in speech emotion recognition (SER). However, most published SER research trained and tested models on data from the same corpus, resulting in poor generalisation ability to unseen data collected from different corpora. To address this, we applied the HuBERT model to a combined training set consisting of five publicly available datasets (IEMOCAP, RAVDESS, TESS, CREMA-D, and 80% CMU-MOSEI) and conducted cross-corpus testing on the Strong Emotion (StrEmo) Dataset (a natural dataset collected by the authors) and two publicly available datasets (SAVEE and 20% CMU-MOSEI). Our best result achieved an F1 score of 0.78 over the three test sets, with an F1 score of 0.86 for StrEmo specifically. Additionally, we are pleased to release the spreadsheet of key information on the StrEmo dataset as supplementary material to the conference.