Investigating HuBERT-based Speech Emotion Recognition Generalisation Capability
Transformer-based architectures have made significant progress in speech emotion recognition (SER). However, most published SER research trained and tested models on data from the same corpus, resulting in poor generalisation ability to unseen data collected from different corpora. To address this, we applied the HuBERT model to a combined training set consisting of five publicly available datasets (IEMOCAP, RAVDESS, TESS, CREMA-D, and 80% CMU-MOSEI) and conducted cross-corpus testing on the Strong Emotion (StrEmo) Dataset (a natural dataset collected by the authors) and two publicly available datasets (SAVEE and 20% CMU-MOSEI). Our best result achieved an F1 score of 0.78 over the three test sets, with an F1 score of 0.86 for StrEmo specifically. Additionally, we are pleased to release the spreadsheet of key information on the StrEmo dataset as supplementary material to the conference.
Item Type | Conference or Workshop Item (Paper) |
---|---|
Date Deposited | 15 May 2025 17:11 |
Last Modified | 31 May 2025 23:09 |
-
picture_as_pdf - Camera-ready_Paper_for_ICAISC_2024.pdf
-
subject - Published Version
-
copyright - Available under Unspecified