Including Context in Facial Expressions Recognition
Many actors in the new merging industry of Emotional AI pretend to be able to provide technology that automatically infers the emotional states of any humans in general contexts. This usually is based on a set of core assumptions. First, that you can automatically infer high-nonlinear mappings between a set of facial expressions and emotional states through deep neural networks. Simply put, the is a general assumption that all problems can be solved if you gather a lot of labeled data and feed it to a deep neural network. The second assumption is that people can reliably infer someone’s emotional state from a set of facial movements. More times than not, this means that in practice the context is ignored when recognizing emotions from faces.
This is problematic for several reasons.
Unfortunately, emotion cannot be reliably inferred only from faces because context plays an important role.
Moreover, while it is indeed true that deep neural networks learn better with more labeled data, the quality of the labels is essential. If fed biased labels, a neural network will learn these biases and make predictions accordingly. At Tawny we take these challenges seriously in order to push the robustness and accuracy of the next generation of facial expressions recognition technology.
Facial Expressions of Emotions in Context
The problem of object classification from images has been a hot topic in machine vision for decades. Much of the debate in the '90s and early 2000s was about designing general descriptors that would provide increased discriminative power for classifying images of objects. The deep learning revolution turned this problem on its head. It came out that by simply giving neural networks examples of how classes look like they could discover, through optimization, better descriptors than any human could design. Thus the whole problem of object recognition was reduced to labeling large amounts of examples and virtually any visual recognition problem was soon to be dominated by this approach.
If asked to tell apart images of cats from dogs, with few exceptions, different persons would agree on the appropriate category almost 100% of time (see Figure 1.a). But telling emotions from faces is far more complicated. For example, when a set of persons were asked to tell how pleasant or unpleasant the person in Figure 2.b feels, a surprising lack of agreement was expressed. The answers clearly fall in two clusters, one group pointing towards very positive emotions and the other towards exactly the opposite. This shows that a facial expression can be judged radically differently depending on imagined context.
Figure 1. Agreement on whether image in (a) is of a dog or a cat is consistently higher than agreement on the emotion the person in (b). Categorizing faces by emotions is not well defined which can result in noisy labels and biased predictors.
Context matters when judging emotion from faces. Take the examples below. On the left, a set of faces are shown out-of-context. Most people, if asked, would incline to associate it with sadness. Nevertheless, on the right the same faces are illustrated in context. One can easily infer that all those examples actually correspond to high valence, almost ecstatic moments of victory and reward. Numerous examples like these exist. Several studies have shown how body posture [1,2], facial expressions of others  or other factors such as age and race  can bias annotators. This points to the important conclusion that labelling large amounts of faces with emotional labels without prior knowledge or context can result in highly biased automatic predictors.
Figure 2. When assessing emotion from faces, context matters. On the left you can see examples of faces out-of-context. On the right context is provided.
At Tawny, we acknowledge and take such matters seriously. We constantly refine our predictors with the incremental addition of labels in context. An example of additional context at labeling time is offered in Figure 3. If only the image of the subject would be shown, several interpretations might be possible. Nevertheless, because a context is attached, namely that the subject is watching an advertisement of a food product and because the potential annotator will have the entire video available when labeling, a better judgment can be made. It is considerably easier in these conditions to decide whether the subject is irritated or disgusted and if this is actually caused by the stimulus or not.
When actually training predictors the following basic principle can be applied. Facial expressions itself is objective and it only has to do with the actual particular activations of facial muscles. Assigning an emotional label is an interpretation of that facial expression. Postponing these interpretations and tightly linking them with well-defined contexts for less biased or noisy predictors is desirable. In this sense, one can first learn emotion-agnostic facial expression representations from large quantities of unlabelled data. At this point, no emotional interpretation is being made. Then, tune facial representation mapping to the emotional state in context, e.g. watching an advertisement of a food product. This means that you can replace a unique facial expression classifier with a set of experts that offer different interpretations according to the context they are perceived in. This also has an additional advantage, as instead of labeling large numbers of faces out-of-context, one can focus on smaller numbers of contextualized labels. Finally, specific compensations for perceptive biases (age, race, gender, social, speech) can complete the picture.
 Avezier et al. “The inherently contextualized nature of facial emotion perception”. (2017).
 Avezier et al. “Angry, disgusted, or afraid? Studies on the malleability of emotion perception”. (2008).
 Matsumoto, D., Keltner, D., Shiota, M., O’Sullivan, M., & Frank, M. "Facial expressions of emotions". In M. Lewis, J. M. Haviland-Jones, & L. F. Barrett (Eds.), Handbook of emotions (3rd ed., pp. 211–234). New York, NY: Macmillan. (2008).
 Barrett, Lisa Feldman, et al. "Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements." Psychological science in the public interest 20.1 (2019): 1-68.