This document discusses automatic 3D facial expression recognition. It begins by introducing the topic and some key challenges. It then summarizes different approaches to three main steps: face acquisition using methods like single image reconstruction or structured light; face tracking and alignment using rigid or non-rigid methods; and expression recognition using static features like distances or patches and dynamic features modeled over time. Future challenges discussed include developing more spontaneous expression databases and improving temporal analysis and system performance.
1. Automatic 3D facial expression recognition
Rafael Monteiro
September 16, 2013
Date Performed: September 10, 2013
Instructors: Claudio Esperan¸ca
Ricardo Marroquim
1 Introduction
Facial expressions are an important aspect of human emotion communication.
They indicate the emotional state of a subject, his personality, among other
features. According to Bettadapura [1], their study begun with clinical and
psychological purposes, but with recent advances on computer vision, computer
science researchers began to show interest on developing systems to automati-
cally detect those expressions.
Automatic facial expression recognition has several applications, as in HCI
(Human-Computer Interaction), where interfaces could be developed in order to
respond to certain user expressions, as in games, communication tools, etc. Al-
though humans can easily recognize a specific facial expression, its identification
by computer systems is not that easy. There are several challenges involved, like
illumination changes, occlusion, use of beards, glasses, etc [2].
In the 70s, one of the first problems faced by researchers was: how to accu-
rately describe an expression? Paul Ekman on his research defined six basic
expressions, which he considered to be universal expressions, because they can
be identified on any culture. They are: joy, sadness, fear, surprise, disgust and
anger [3]. Examples are shown in Figure 1. Later in 2001, Parrot identified
136 emotional states and categorized them on three levels: primary, secondary
and tertiary emotions [4]. Primary emotions are Ekman’s six basic emotions,
and the other two levels form a hierarchy. Still in 1971, Ekman wrote a study
claiming facial expressions were universal across different cultures [5].
Figure 1: Universal expressions: joy, sadness, fear, surprise, disgust and anger
1
2. In 1977 Ekman and Friesen developed a methodology to measure expressions
in a more precise way, by creating FACS (Facial Action Coding System) [6].
On FACS, basic expression components are defined, called Action Units (AUs).
They describe small facial movements, such as raising the inner brows (AU1),
or wrinkling the nose (AU9), an so on. These action units can be combined to
form facial expressions.
A discussion about the universality of human expressions arose in 1994, when
Russell questioned Ekman’s position and discussed several points indicating that
human expressions are not universal across different cultures [7]. In the same
year, Ekman wrote a paper refuting Russell’s arguments one by one [8]. Since
then Ekman’s position has been widely accepted and the claim that human ex-
pressions are universal across different cultures has been sustained.
Facial expression recognition research has many fields of study. One of them is
3D facial expression recognition. These systems are based on facial surface in-
formation obtained by creating a 3D model of the subject’s face, and they try to
identify the expression in this model. This report will discuss some approaches
used in this field. There is a major division between static and dynamic stud-
ies. Static studies are performed on a single picture of a subject, where the
expression is identified, and dynamic studies consider the temporal behavior
of expressions (see Figure 2). A great example of dynamic studies is micro-
expression analysis. A micro-expression is an expression that happens in a very
short instant of time, generally between 1/25th to 1/15th of a second. They
generally occur when a subject is trying to conceal an expression but fail, and
it will appear for a brief moment on the face.
Figure 2: Example of a dynamic facial expression system
One major problem on facial expression studies is to capture spontaneous ex-
pressions. Most facial expressions databases are composed of simulated expres-
sions, such as the ones displayed in Figure 3. It is easier to ask subjects to
display these expressions than to capture expressions generated spontaneously
based on emotional reactions to real world stimuli. An interesting development
2
3. occured when Sebe et al. gave a solution to this problem by using a kiosk with
a camera [9]. People would stop by and watch videos, while displaying genuine
emotions, and their face was being captured by a camera. At the end of the
study, subjects were asked if they would allow their image to be used for aca-
demic purposes.
Figure 3: Examples of clearly non-spontaneous facial expressions
2 Facial expression systems
There are many approaches used by facial expression systems. In a recent survey,
Sandbach et al. reviewed the state of the art and they noticed most systems are
organized in three steps: face acquisition, face tracking and alignment, and
expression recognition [10].
2.1 Face acquisition
Face acquisition is a step performed with the objective of generating a 3D model
of the subject’s face. There are some approaches, such as single image recon-
struction, use of structured light, stereo photometry and multi-vision stereo
acquisition.
Single image reconstruction methods are an emerging research topic, because
of its simplicity: only a single image is required, using an ordinary camera in
a non-restricted environment. Blanz and Vetter developed a method called 3D
Morphable Models (3DMM), which statistically builds a model combining infor-
mation of 3D shape and 2D texture [11]. The method can be used to generate
linear combinations of different expressions and use them to synthesize expres-
sions and detect them on facial models. The main disadvantages are: some
initialization is required and the method is not robust to partial occlusions.
Structured light techniques are based on projecting a light pattern on the sub-
ject’s face, analyzing the pattern deformations and recovering 3D shape infor-
mation. Figure 4 shows an example of such systems. Hall and Rusinkiewicz
developed a system using multiple patterns, which are alternately displayed on
the face [12]. An image without the pattern can also be captured in order to
incorporate 2D texture information on the 3D model.
3
4. Figure 4: Illustration of a structured light system
Stereo photometry is a variation of structured light techniques which uses more
than one light, and each one can emit a different color, as shown in Figure 5.
Such systems can retrieve surface normals, which can be integrated in order
to recover 3D shape information. Jones et al. developed a system which uses
three lights switching on and off in a cycle around the camera [13]. The system
performs well using either visible or infrared light.
Figure 5: Illustration of a stereo photometry system
Multi-vision stereo acquisition systems use more than one camera to simul-
taneously capture images from different angles and combine these images to
reconstruct the scene. Beeler at al. developed a system which uses high-end
cameras and standard illumination, showing great results with sub-millimeter
accuracy [14].
4
5. 2.2 Face tracking and alignment
The second step performed on most facial expression systems is face tracking
and alignment. Given two meshes, the problem is to align them in 3D space,
so that they can be tracked over time. There are two kinds of alignment: rigid
approaches, which assume similar meshes without large transformation, and
non-rigid, which deals with large transformations. Most rigid-based approaches
rely on the traditional ICP (Iterative Closest Point) algorithm [15]. As for non-
rigid approaches, there are several different ways to perform the alignment.
Amberg et al. created a variant of ICP which adds a stiffness variable to con-
trol the rigidity of the transformation at each iteration [16]. The stiffness value
starts with a high value and is reduced at each iteration, so that the matching
will gradually allow a non-rigid transformation to be performed. Rueckert et
al. used a FFD (Free-Form Deformation) model which performs deformations
using control points [17]. By reducing the number of points, computing time
can be reduced as well. See Figure 6 for an example of a FFD model.
Figure 6: Free-Form Deformation model
Wang et al. used harmonic maps to perform the alignment [18]. The face is
mapped from 3D space to 2D space, by projecting the mesh into a disc, as shown
in Figure 7, thus reducing one dimension. Different discs can be compared in
order to perform alignment. Sun et al. used a similar technique called conformal
mapping, which maps the mesh into a 2D space, preserving the angles between
edges [19]. Tsalakanidou and Malassiotis modified ASMs (Active Shape Models)
[20] to work in 3D, using a face model with the most prominent features, such
as eyes, nose, etc [21]. Figure 8 shows examples of ASMs plotted on the faces.
2.3 Expression recognition
The third and last step of a facial expression system is to recognize the expres-
sion. In this step, descriptors are extracted, selected and classified using artificial
intelligence techniques. Features can be static or dynamic. Static features are
mostly used on a single image, whereas dynamic features have the property to
be stable across time, and can be tracked through successive frames on a video
5
6. Figure 7: Harmonic maps
Figure 8: Active Shape Models
analysis. Temporal modeling can be done in order to analyze the dynamics of
the expression through time. Most systems use HMMs (Hidden Markov Models)
[22] to perform this task. Common static features are distance-based, patch-
based, morphable models and 2D representations.
Distance-based features rely on distances between facial attributes, such as the
distance between the corners of the mouth, or between the mouth and the eye,
and so on. Soyel and Demirel used 3D distances to recognize expressions [23].
Maalej et al. used patch-based features, where patches are small regions of the
mesh represented as surface curves [24], as shown in Figure 9. Patches are
compared against templates by computing the geodesic distance between them.
Ramanathan et al. used a MEM (Morphable Expression Model), where base
expressions are defined and any expression could be modeled through a linear
combination between these base expressions by using morphing parameters [25].
These parameters define a parameter space, where similar expressions form clus-
ters. A new expression is identified by finding the parameters which generate
6
7. the closest expression and passing these parameters to a classifier. Berretti et al.
used 2D representations, where the depth map of the face is computed, generat-
ing a 2D image [26]. Classification is done using SIFT (Scale Invariant Feature
Transform) descriptors [27] and SVMs (Support Vector Machines) [28].
Figure 9: Patch-based descriptors
As for dynamic features, there are a few approaches. Le et al. used facial level
curves, since their variation through time can be tracked and calculated using
Chamfer distances [29]. Figure 10 shows an example of such curves. Sandbach
et al. used FFDs to model the lattice deformation over time, and they used
HMMs to perform temporal analysis [30].
Figure 10: Facial level curves
Feature classification is generally performed using well known classifiers, such
as AdaBoost and variations [31], k-NNs (k-Nearest Neighbors) [32], Neural Net-
works [33], SVMs [28], etc.
7
8. 3 Future challenges
Research on 3D facial expression recognition is evolving, but there are some
challenges to consider. One is the construction of more spontaneous expressions
databases, since most of them were built using artificial expressions. Further-
more, the development of systems capable to distinguish a spontaneous expres-
sion from an artificial one is also desirable. Recognition of expressions other
than Ekman’s six universal expressions is important, since most systems focus
only on these six. Temporal analysis is still on its infant stage. More focus on
this area is required, especially on the analysis of micro-expressions, which are
very hard to detect. Improvement on algorithms performance is also a crucial
factor. Ideally, all systems should work in real-time.
References
[1] V. Bettadapura. Face expression recognition and analysis: The state of the
art. CoRR, abs/1203.6722, 2012.
[2] M. Pantic, Student Member, and L. J. M. Rothkrantz. Automatic analysis
of facial expressions: The state of the art. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22:1424–1445, 2000.
[3] P. Ekman. Universals and Cultural Differences in Facial Expressions of
Emotion. University of Nebraska Press, 1971.
[4] W.G. Parrott. Emotions in Social Psychology: Essential Readings. Key
readings in social psychology. Psychology Press, 2001.
[5] P. Ekman and W. V. Friesen. Constants across cultures in the face and emo-
tion. Journal of Personality and Social Psychology, 17(2):124–129, 1971.
[6] P. Ekman and W.V. Friesen. ”Manual for the Facial Action Coding Sys-
tem”. Consulting Psychologists Press, 1977.
[7] J. A. Russell. Is there universal recognition of emotion from facial ex-
pressions? A review of the cross-cultural studies. Psychological Bulletin,
115(1):102–141, 1994.
[8] P. Ekman. Strong evidence for universals in facial expressions: a reply to
Russell’s mistaken critique. Psychology Bulletin, 115(2):268–287, 1994.
[9] N. Sebe, M.S. Lew, I. Cohen, Yafei Sun, T. Gevers, and T.S. Huang. Au-
thentic facial expression analysis. In Automatic Face and Gesture Recog-
nition, 2004. Proceedings. Sixth IEEE International Conference on, pages
517–522, 2004.
[10] G. Sandbach, S. Zafeiriou, M. Pantic, and L. Yin. Static and dynamic 3d
facial expression recognition: A comprehensive survey. Image and Vision
Computing, 30(10):683 – 697, 2012. ¡ce:title¿3D Facial Behaviour Analysis
and Understanding¡/ce:title¿.
8
9. [11] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces.
In Proceedings of the 26th annual conference on Computer graphics and
interactive techniques, SIGGRAPH ’99, pages 187–194, New York, NY,
USA, 1999. ACM Press/Addison-Wesley Publishing Co.
[12] O. Hall-Holt and S. Rusinkiewicz. Stripe boundary codes for real-time
structured-light range scanning of moving objects. In Eighth IEEE Inter-
national Conference on Computer Vision, pages 359–366, 2001.
[13] A. Jones, G. Fyffe, Xueming Yu, Wan-Chun Ma, J. Busch, R. Ichikari,
M. Bolas, and P. Debevec. Head-mounted photometric stereo for perfor-
mance capture. In Visual Media Production (CVMP), 2011 Conference for,
pages 158–164, 2011.
[14] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross. High-quality
single-shot capture of facial geometry. In ACM SIGGRAPH 2010 papers,
SIGGRAPH ’10, pages 40:1–40:9, New York, NY, USA, 2010. ACM.
[15] P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, 14(2):239–
256, 1992.
[16] B. Amberg, S. Romdhani, and T. Vetter. Optimal step nonrigid icp algo-
rithms for surface registration. In Computer Vision and Pattern Recogni-
tion, 2007. CVPR ’07. IEEE Conference on, pages 1–8, 2007.
[17] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J.
Hawkes. Nonrigid registration using free-form deformations: Application
to breast mr images. IEEE Transactions on Medical Imaging, 18:712–721,
1999.
[18] Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, and P. Huang.
High resolution tracking of non-rigid motion of densely sampled 3d data
using harmonic maps. Int. J. Comput. Vision, 76(3):283–300, March 2008.
[19] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking vertex flow and model
adaptation for three-dimensional spatiotemporal face analysis. Systems,
Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions
on, 40(3):461–474, 2010.
[20] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape
models-their training and application. Computer Vision and Image Under-
standing, 61(1):38 – 59, 1995.
[21] F. Tsalakanidou and S. Malassiotis. Real-time facial feature tracking from
2d-3d video streams. In 3DTV-Conference: The True Vision - Capture,
Transmission and Display of 3D Video (3DTV-CON), 2010, pages 1–4,
2010.
[22] L. E. Baum and T. Petrie. Statistical Inference for Probabilistic Functions
of Finite State Markov Chains. The Annals of Mathematical Statistics,
37(6):1554–1563, 1966.
9
10. [23] H. Soyel and H. Demirel. Facial expression recognition using 3d facial fea-
ture distances. In Mohamed Kamel and Aur´elio Campilho, editors, Image
Analysis and Recognition, volume 4633 of Lecture Notes in Computer Sci-
ence, pages 831–838. Springer Berlin Heidelberg, 2007.
[24] A. Maalej, B. Ben Amor, M. Daoudi, A. Srivastava, and S. Berretti. Local
3d shape analysis for facial expression recognition. In Pattern Recognition
(ICPR), 2010 20th International Conference on, pages 4129–4132, 2010.
[25] S. Ramanathan, A. Kassim, Y.V. Venkatesh, and W.S. Wah. Human facial
expression recognition using a 3d morphable model. In Image Processing,
2006 IEEE International Conference on, pages 661–664, 2006.
[26] S. Berretti, B. Ben Amor, M. Daoudi, and A. del Bimbo. 3d facial expres-
sion recognition using sift descriptors of automatically detected keypoints.
The Visual Computer, 27(11):1021–1036, 2011.
[27] D.G. Lowe. Object recognition from local scale-invariant features. In Com-
puter Vision, 1999. The Proceedings of the Seventh IEEE International
Conference on, volume 2, pages 1150–1157 vol.2, 1999.
[28] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[29] V. Le, H. Tang, and T.S. Huang. Expression recognition from 3d dynamic
faces using robust spatio-temporal shape features. In Automatic Face Ges-
ture Recognition and Workshops (FG 2011), 2011 IEEE International Con-
ference on, pages 414–421, 2011.
[30] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert. Recognition of
3d facial expression dynamics. Image and Vision Computing, 30(10):762 –
773, 2012. 3D Facial Behaviour Analysis and Understanding.
[31] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting, 1995.
[32] David Bremner, Erik Demaine, Jeff Erickson, John Iacono, Stefan Langer-
man, Pat Morin, and Godfried Toussaint. Output-sensitive algorithms for
computing nearest-neighbour decision boundaries. In F. Dehne, J. Sack,
and M. Smid, editors, Algorithms and Data Structures, volume 2748 of
Lecture Notes in Computer Science, pages 451–461. Springer Berlin Hei-
delberg, 2003.
[33] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall
PTR, Upper Saddle River, NJ, USA, 2nd edition, 1998.
10