Including Context in Facial Expressions Recognition
Many actors in the new merging industry of Emotional AI pretend to be able to provide technology that automatically infers emotional states of any humans in general contexts. This usually is based on a set of core assumptions:
First, that you can automatically infer high-nonlinear mappings between a set of facial expressions and emotional states through deep neural networks. Simply put, there is a general assumption that all problems can be solved if you gather a lot of labelled data and feed it to a deep neural network.
Second, that people can reliably infer someone’s emotional state from a set of facial movements. More times than not, this means that in practice the context is ignored when recognizing emotions from faces.
This is problematic for several reasons. Unfortunately emotion cannot be reliably inferred only from faces because context plays an important role. Moreover, while it is indeed true that deep neural networks learn better with more labelled data, the quality of the labels is essential. If fed biased labels, a neural network will learn these biases and make predictions accordingly. At TAWNY, we take these challenges seriously in order to push robustness and accuracy of next generation of facial expressions recognition technology.
Facial Expressions of Emotions in Context
The problem of object classification from images has been a hot topic in machine vision for decades. Much of the debate in the 90's and early 2000's was about designing general descriptors that would provided increased discriminative power for classifying images of objects. The deep learning revolution turned this problem on its head. It came out that by simply giving neural networks examples of how classes look like they could discover, through optimization, better descriptors than any human could design. Thus the whole problem of object recognition was reduced to labelling large amounts of examples and virtually any visual recognition problem was soon to be dominated by this approach.
If asked to tell apart images of cats from dogs, with few exceptions, different persons would agree on the appropriate category almost 100% of time (see Figure 1.a). But telling emotions from faces is far more complicated. For example, when a set of persons were asked to tell how pleasant or unpleasant the person in Figure 2.b feels, a surprising lack of agreement was expressed. The answers clearly fall in two clusters, one group pointing towards very positive emotions and the other towards exactly the opposite. This shows that a facial expression can be judged radically different depending on imagined context.
Figure 1. Agreement on whether image in (a) is of a dog or a cat is consistently higher than agreement on the emotion the person in (b). Categorizing faces by emotions is not well defined which can result in noisy labels and biased predictors.
Context matters when judging emotion from faces. Take the examples below. On the left, a set of faces are shown out-of-context. Most people, if asked, would incline to associate it with sadness. Nevertheless, on the right the same faces are illustrated in context. One can easily infer that all those examples actually correspond to high valence, almost ecstatic moments of victory and reward. Numerous examples like these exist. Several studies have shown how body posture [1,2], facial expressions of others  or other factors such as age and race  can bias annotators. This points to the important conclusion that labelling large amounts of faces with emotional labels without prior knowledge or context can result in highly biased automatic predictors.
Figure 2. When assessing emotion from faces, context matters. On the left you can see examples of faces out-of-context. On the right context is provided.
At TAWNY, we acknowledge and take such matters seriously. We constantly refine our predictors with incremental addition of labels in context. The basic principle is the following: facial expressions themselves are objective and it only has to do with the actual particular activations of facial muscles. Assigning an emotional label is an interpretation of that facial expression. Postponing these interpretations and tightly linking them with well defined contexts for less biased or noisy predictors is desirable. In this sense, one can first learn emotion-agnostic facial expressions representations from large quantities of unlabeled data. At this point, no emotional interpretation is being made. Then, tune facial representation mapping to emotional state in context. This means that you can replace a unique facial expression classifier with a set of experts that offer different interpretations according to the context they are perceived in. This also has an additional advantage, as instead of labelling large numbers of faces out-of-context, one can focus on smaller numbers of labels in context which in the end reduces labelling costs. Finally, specific compensations for perceptive biases (age, race, gender, social, speech) can complete the picture.
 Avezier et al. “The inherently contextualized nature of facial emotion perception”. (2017).
 Avezier et al. “Angry, disgusted, or afraid? Studies on the maleability of emotion perception”. (2008).
 Matsumoto, D., Keltner, D., Shiota, M., O’Sullivan, M., & Frank, M. "Facial expressions of emotions". In M. Lewis, J. M. Haviland-Jones, & L. F. Barrett (Eds.), Handbook of emotions (3rd ed., pp. 211–234). New York, NY: Macmillan. (2008).
 Barrett, Lisa Feldman, et al. "Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements." Psychological science in the public interest 20.1 (2019): 1-68.