Increasing Robustness of FER against Speech

Facial Emotion Recognition (FER) has taken many leaps forwards in the last year...
a report of our Masters graduate Marcel Baur 

Figure 1Figure 1: (C) TAWNY

 

... however, as Ciprian has already described in one of our previous blog posts, *context* is still a factor that can radically change the perception of human emotion from facial expressions. One concrete example of contextual bias stems from *speech effects*.
  •  
My master's thesis **Increasing Robustness of Facial Emotion Recognition against Speech** tries to tackle the problem of improving the performance of our models for speaking subjects. With the help of my supervisors Matthias and Ciprian and the team at TAWNY, I built and trained models for emotion recognition on talking subjects.
 
# The Problem
 
How exactly does the problem manifest itself? Current FER models analyze the facial movements of a subject and make an emotional prediction based on them. The subject in figure 1 shows a distinct smile and raised cheeks. Models will recognize this and (correctly) predict a happy emotion.
 
Let's now take a look at a set of predictions for a speaking subject. Figure 2 shows a set of predictions on selected frames from a video of the RAVDESS dataset [[https://zenodo.org/record/1188976](https://zenodo.org/record/1188976)]. The underlying emotion of the video is sad, the models however predict different emotions in certain frames. The speech effect on the facial movements was big enough to throw off the models. Even human annotators might be fooled by this!
 
Fig 2Figure 2: (C) Zenodo.org

# The Way Forward
 
The first step in writing a thesis will always be to get a feel for the lay of the land. I researched existing methodologies for the issue at hand and found several existing approaches that I ultimately decided to reimplement and improve on.
 
Already existing approaches were not the only point of inquiry though. We were also interested in the ability of human emotion recognition. To find potential biases in the human perception of emotion in speaking subjects, we implemented a labeling tool and conducted a small study, where the participants were tasked to label the perceived emotions in certain videos and images.
 
These insights enabled us to build several models and compare them with each other. The knowledge we gained will be very helpful for building models that will work on a wide variety of inputs and produce high-quality emotional estimations.
 

# Final Thoughts

 

So, what have we learned? Considering divergent contexts can be challenging for tasks like FER, and require specialised models to adequately perform in real world context. The task of compensating speech effects has given us great insights into our future projects, and helped us think outside the box for the challenges ahead.

Interested in research with TAWNY, too. Just contact us and let's form a topic. Additionally, feel free to download our success stories.

 

In the end, we like to say a big THANK YOU to MARCEL, it was a pleasure to work with you in this field of research - keep on rocking!

 

 
 

Success Stories →

 

 

More from our blog:

Biathlon Austria this time in Norway

Jan 11, 2019 9:58:00 PM
At the end of November, team TAWNY traveled far in the North - we were in Norway, in Trysil, for a special mission in the matter of biathlon.

You know - biathlon, cross-country skiing and shooting...

TAWNY - Look-a-like app

Aug 9, 2018 3:39:00 PM

We call it #LookALike.

TAWNY & Red Bull Media House

TAWNY  - Detecting the DNA of FLOW: It was a real pleasure to meet the future biathlon stars at Lenzerheide and make some sporty visionary AI -TAWNY measurements.