Increasing Robustness of FER against Speech
... however, as Ciprian has already described in one of our previous blog posts, *context* is still a factor that can radically change the perception of human emotion from facial expressions. One concrete example of contextual bias stems from *speech effects*.
My master's thesis **Increasing Robustness of Facial Emotion Recognition against Speech** tries to tackle the problem of improving the performance of our models for speaking subjects. With the help of my supervisors Matthias and Ciprian and the team at TAWNY, I built and trained models for emotion recognition on talking subjects.
# The Problem
How exactly does the problem manifest itself? Current FER models analyze the facial movements of a subject and make an emotional prediction based on them. The subject in figure 1 shows a distinct smile and raised cheeks. Models will recognize this and (correctly) predict a happy emotion.
Let's now take a look at a set of predictions for a speaking subject. Figure 2 shows a set of predictions on selected frames from a video of the RAVDESS dataset [[https://zenodo.org/record/1188976](https://zenodo.org/record/1188976)]. The underlying emotion of the video is sad, the models however predict different emotions in certain frames. The speech effect on the facial movements was big enough to throw off the models. Even human annotators might be fooled by this!
Figure 2: (C) Zenodo.org
# The Way Forward
The first step in writing a thesis will always be to get a feel for the lay of the land. I researched existing methodologies for the issue at hand and found several existing approaches that I ultimately decided to reimplement and improve on.
Already existing approaches were not the only point of inquiry though. We were also interested in the ability of human emotion recognition. To find potential biases in the human perception of emotion in speaking subjects, we implemented a labeling tool and conducted a small study, where the participants were tasked to label the perceived emotions in certain videos and images.
These insights enabled us to build several models and compare them with each other. The knowledge we gained will be very helpful for building models that will work on a wide variety of inputs and produce high-quality emotional estimations.
# Final Thoughts
So, what have we learned? Considering divergent contexts can be challenging for tasks like FER, and require specialised models to adequately perform in real world context. The task of compensating speech effects has given us great insights into our future projects, and helped us think outside the box for the challenges ahead.
Interested in research with TAWNY, too. Just contact us and let's form a topic. Additionally, feel free to download our success stories.
In the end, we like to say a big THANK YOU to MARCEL, it was a pleasure to work with you in this field of research - keep on rocking!