key: cord-0175366-7g1zf6ly
authors: Ng, Evonne; Joo, Hanbyul; Hu, Liwen; Li, Hao; Darrell, Trevor; Kanazawa, Angjoo; Ginosar, Shiry
title: Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
date: 2022-04-18
journal: nan
DOI: nan
sha: 7ea4ea5266b1e82807ca092c14fd064c997bec81
doc_id: 175366
cord_uid: 7g1zf6ly

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.

"Thus the body of the speaker dances in time with his speech. Further, the body of the listener dances in rhythm with that of the speaker!" -CONDON AND OGSTON, 1966 When we speak, it is rarely in a void -rather, there is often a listener at the other end of the conversation. As a speaker, we are acutely aware of what the listener is doing. A slight off-sync motion or a diverted look may throw us off, suggesting the listener is bored or otherwise preoccupied, leaving us feeling misunderstood [41] . Indeed, successful conversations rely on a coordinated dance between the speaker and the listener in which the two signal to each other that they are communicating with one another and not with anyone else [41] . This chameleon effect [12] of nonverbal mimicry during conversation results in smoother interactions, increases the liking between interaction partners, establishes rapport [43] , and may even predict the long term outcome of psychotherapy [56] . Interestingly, nonverbal feedback from a listener, such as head movement is more central to keeping a conversation flowing than content-based replies [11] . In this work, we propose a computational framework that can similarly provide nonverbal feedback in response to a speaker in a contextual and timely manner. Such an ability is critical for virtual agents to meaningfully interact with humans, for whom nonverbal communication is central from infancy [61] .

Modeling nonverbal feedback during dyadic interaction is a difficult problem, as listener responses are nondeterministic in nature. Moreover, speakers are inherently multimodal, as they communicate both verbally via speech, and nonverbally via face and body motion. Capturing interaction in its natural setting requires addressing both challenges. The task of modeling human conversations has a long history. However, unlike traditional rule-based methods [5, 10, 25, 32] or methods that rely on modeling hand-defined simple motion characteristics such as smiles [58] or head nods [25, 32] , we wish to model the true complexity of the interaction. This is hard to achieve and generalize using conventional database methods that generate motion via a lookup into a database of ground truth motion [40, 59, 67] . We, therefore, learn to model these dyadic conversational dynamics implicitly in a data-driven way by directly observing human conversations in in-the-wild videos.

Given a video of a speaker, we extract their speech audio, and facial motion (Figure 1(left) ). We combine information from both modalities using a motion-audio cross-attention transformer. From this multimodal speaker input, we learn to autoregressively synthesize multiple modes of motion representing different possible responses of a listener who moves synchronously with the speaker (Figure 1(right) ).

Modeling the nondeterminism in listener responses is a key element in capturing conversational dynamics. Previous attempts to tackle this problem applied various techniques but fell short of achieving realistic outputs [38] . We propose to learn a realistic manifold of listener motion by quantizing the space of listener motion with a novel sequence-encoding VQ-VAE [64] , which efficiently captures a wide range of motion in a discrete format that is well-suited for learning. To the best of our knowledge, we are the first to extend VQ-VAE models to the domain of motion synthesis. The learned discrete codebook of listener motion allows us to predict a multinomial distribution of future motion. From this distribution we can sample a wide range of possible modes of motion representing different perceptually-plausible listeners, capturing their inherent non-deterministic nature. Furthermore, we demonstrate our learned discrete latent codes can stay on the manifold of realistic motion ensuring no motion drift occurs even in long-horizon predictions. Meanwhile, the autoregressive nature of our method allows us to consider speaker sequences of any length.

To support our data-driven approach to modeling human conversation, data is needed in the form of videotaped dyadic interactions where both parties are ideally filmed from a head-on frontal view. This kind of data is hard to come by. While the first investigation of interactional synchrony in conversation dates back to Condon and Ogston in 1966 [17] , current studies still mostly rely on inlab footage [13, 22, 27, 32] or small-scale motion-capture datasets [7, 38] . Notable exceptions are [20, 50] , yet the footage has not been made publicly available. We collect a large-scale source of data in the form of split-screen recorded online interviews where the speaker and listener are captured in frontal view. Our dataset, which consists of 72 hours of inthe-wild conversations, enables the investigation of dyadic communication using the latest machine learning methods.

We evaluate the synthesized listener motion compared to ground truth as well as baseline methods and ablations via an extensive quantitative study. We employ a wide array of metrics to test the realism and diversity of the synthesized motion, and the synchronization of the listener's motion with that of the speaker. While measuring realism and diversity centers on the generated motion of the listener in isolation, synchrony captures aspects of the dyad as a whole. We further corroborate our qualitative findings by inviting human observers to evaluate our results. While we assess our method using the raw 3D mesh output, we additionally illustrate our results by translating the 3D output to pixels for viewing purposes only, as synthesized video provides a richer perceptual context. Under both quantitative and qualitative measures, our method significantly outperforms all baselines. Our synthesized listeners were deemed plausible by human observers when compared to ground-truth motion. This highlights our method's ability to produce realisticlooking motion that is synchronous with a given speaker.

Our main contribution is in our learning-based approach towards understanding human interactional communication in conversation. We combine multimodal speaker inputs via motion-audio cross-attention. We extend vector quantization to the domain of motion synthesis and learn a quantized space of motion in which we autoregressively predict multiple modes of perceptually realistic listener motion. To support future endeavors in this direction we publicly release a novel dataset of 72 hours of in-the-wild dyadic conversational videos with detailed 3D annotations capturing subtleties in expression and fine-grain head motion.

We discuss related works concerned with conversational agents and motion synthesis. For a review of interactional motion in human communication, see Appendix A. Interactional Motion in Conversational Agents. Prior works on conversational avatars manually incorporated different aspects of interactional motion [5, 10, 25, 32, 60] . These approaches designed rule-based methods to generate agents that can interact via appropriate facial gestures [25, 32, 60] , speech [10] , or a combination of modalities [5] . All these methods use lab-recorded motion capture sequences. These either limit the variety of captured gestures, or rely on simplifying assumptions for motion generation which do not hold for in-the-wild data.

Prior data-driven methods predict the 2D motion of one person in a conversation as a function of the other's motion [20, 50] . These require a pre-defined dictionary achieved by clustering motion frequencies or 2D facial keypoints from the training set. Rather, we reason in 3D and learn a discretized latent space that captures the manifold of facial motion. Other methods using 3D investigate interactional dynamics while focusing on full 3D body motion and turn taking [2, 39] . Others tackle the problem of facial gestures in conversation by simplifying the task to predicting head nods [2] , estimating head pose [26] , or generating a single image of a facial expression that summarizes the entire speaker sequence [33, 50] . In contrast, our method captures the natural complexity of interactions by considering the full range of facial expressions and head rotations.

Recent methods began generating 3D facial motion with additional inputs from the listener such as text [14] or speech [37, 38] . Most similar to our approach is that of Jonel et al. [38] , who propose a Glow-based method [28, 42] . However, their method takes as input the full temporal context of listener audio and is reported to perform better without any audio input. In contrast, our method does not use any listener audio as additional input. Additionally, we quantitatively demonstrate that each of the input modalities is essential to its performance.

Conditional Motion Synthesis. Gestural motion synthesis has previously relied on convolutional auto-encoders to learn a representation of human motion [20, 23, 38, 39, 49] . Some methods incorporated an adversarial loss [23, 49] or experimented with flow models [38] and other sampling-based methods [20] to generate more diverse and realistic motion. Recent works demonstrated the success of using transformers in generating diverse motion with long-range dependencies [9, 44, 45, 55] . These generate possible motion segments conditioned on action [55] , 3D human motion trajectory in a scene conditioned on a goal [9] , or dance motion from audio [44, 45] . Similarly, we employ a transformer-based predictor for conditional motion synthesis. Additionally, to the best of our knowledge, we are the first to demonstrate the benefits of using vector quantization (VQ-VAE [64] ) to achieve improved motion synthesis results. In essence, rather than relying on the addition of Perlin noise [54] for improved realism, we learn the fine details of realistic motion in a datadriven way.

Our goal is to model the conversational dynamics between a speaker and a listener. To test whether our model captures the subtleties of face-to-face communication, we synthesize the interactional motion responses of the listener, which are known to be essential to the flow of conversation [12, 41, 43] . Figure 2 . Overview: We predict a distribution over future listener motion conditioned on multimodal inputs from a speaker. We use cross-modal attention to fuse the speaker audio and motion input, and a novel sequence-encoding VQ-VAE to discretize past listener motion. Our autoregressive Predictor outputs a distribution over the K discrete codebook indices, from which we sample a code for the next timestep. We obtain the continuous future listener motion by decoding the sampled codebook index. We define the following task: given the 3D facial motion and audio of the speaker, we autoregressively predict the corresponding facial motion of the listener.

To represent the ongoing flow of conversation, we define a transformer-based predictor, P, that learns to model temporally long-range patterns in the input sequence (Sec. 3.4). The predictor takes two inputs: one corresponding to the speaker and the other to the listener ( Figure 2 ).

To model the speaker's audio and facial motion, we introduce a motion-audio cross-modal transformer that learns to fuse the two modalities (Sec. 3.3). To represent the manifold of realistic listener facial motion, we extend VQ-VAE [64] to the domain of motion synthesis and learn a codebook of a discrete latent space (Sec. 3.2). This discrete representation enables us to predict a multinomial distribution over the next timestep of motion. Thus, the output of the autoregressive predictor is a distribution over possible synchronous and realistic listener responses, from which we can sample multiple trajectories.

be a temporal sequence of facial motions f i . We use F S and F L to denote the motion of the speaker and listener respectively. For each timestep t ∈ [1, T ], we take as input a speaker's facial motion F S 1:t = (f S 1 , · · · , f S t ) and their corresponding speaker audio sequence A S 1:t , along with any previously predicted past listener motionF L 1:t−1 , if available. Our predictor, P, then autoregressively predicts the corresponding listener facial motion one time-step at a time:

where P learns to model the distribution over the next timestep of listener motion

To obtain speaker-only audio, we filter out all listener audio back-channels using sound source separation [51] . To represent the motion, we estimate the 3D facial expressions and orientations from video frames of human conversations using a 3D Morphable Face Model (3DMM) [4, 8, 46, 53] . 3DMMs are parametric facial models that allow us to directly regress disentangled coefficients corresponding to facial expression, head orientation, and identity-specific shape from a single image [69] . This process results in facial expression coefficients β t ∈ R dm , where d m is the dimension of the expression coefficient, a normalized 3D head pose R t ∈ SO(3), and shape coefficents that we discard to obtain an identity-agnostic representation. Our facial representation at time t, f t ∈ R dm+3 , is a concatenation of expression and orientation (in Euler angles):

We normalize facial orientation by computing the mean frontal face direction per video (i.e., orientation at rest pose) and align all head poses in the sequence with respect to this rest pose. This allows us to achieve a camera-view agnostic representation. In contrast to the 2D representations used in some prior works [20, 50] , our 3D representation is invariant to changes in facial shape, scale, and camera pose, allowing us to generalize across new faces and camera viewpoints.

We extend the use of VQ-VAE [64] to produce multiple realistic modes of different listener responses. VQ-VAE was originally proposed as a method to learn a quantized codebook of image elements from which images could be synthesized autoregressively. Convolutional architectures were used both for learning the codebook and for recombining the discrete elements into images [64] . While the synthesis step was later replaced by transformer architectures that can learn long-range connections [18] , image-generation approaches employ a convolutional encoder-decoder pair. This is wellsuited for images but not for temporal sequences where convolving over the temporal domain may lose high-frequency information. We design a novel sequence-encoding VQ-VAE where we utilize transformers for the encoder-decoder pair. To the best of our knowledge, we are the first to apply a VQ-VAE to the domain of motion generation.

The advantages of this method are three-fold: (1) it allows us to predict a multinomial distribution over future motion from which we can sample many possible output modes, (2) using the learned discrete latent codes allows us to stay on the manifold of realistic motion ensuring no drift occurs (a problem for methods that directly regress continuous outputs [3] ), and (3) it produces realistic motion that captures high-frequency movements.

Specifically, we train a VQ-VAE transformer encoder E and decoder D. To handle the temporal nature of the input, we learn to model longer listener motion sequences in terms of shorter temporal components. Rather than considering static expressions/rotations independently, the latent embedding covers multiple frames, allowing it to learn motion dynamics. The latent embedding represents motion segments of temporal window size w T from a discrete codebook Z = {z k } K k=1 , where z k ∈ R dz , that we jointly learn with E and D. Z maps each of the K codebook entries to a discrete code element of dimension d z . As shown in Figure 3 , we can then approximate any raw listener motion segment

is the length of the patch-wise encoded sequence. Second, we obtain the quantized sequence, z q , via an element-wise quantization function q(·) that maps each element of the encoded sequenceẑ to its closest codebook entry:

Finally, the reconstructionx ≈ x is given by:

We train E, D and the codebook with the loss function [64] ,

where x −x 2 is a reconstruction loss, sg[·] is a stopgradient operation, and sg[z q ] − E(x) 2 2 is a "commitment loss" [64] . After learning the codebook of listener motion, we use the pretrained encoder to quantize the listener motion input to the predictor ( Figure 2 ).

From the speaker, we take as input both audio a = A S 1:t+w , and facial motion m = F S 1:t+w . Here, w is the amount of additional future context we see from the speaker. This context acts as a feedback delay that is beneficial in improving learned synchrony for robotics [66] . In contrast to the listener motion, we do not quantize the speaker inputs. While we experimented with both options, we found that speaker motion quantization did not improve results, and quantizing the audio deteriorated the results significantly. We conclude that while quantization is beneficial for predicted motion, for the quality of results as well as sampling capabilities, it is not advantageous for input modalities.

We learn to fuse the audio and motion modalities together using cross-modal attention. Cross-model attention of text and audio [1] or language and vision [35, 47, 63] has been shown to outperform early or late fusion. We extend its use to successfully fuse information from motion and audio, a

Color Brewer 8-class Set2 Figure 3 . Motion VQ-VAE that learns a discrete listener motion codebook. The input is a T length sequence of raw listener facial motion (expression coefficients and 3D head rotations). The transformer sequence-encoder E compresses the input into an embedding that gets mapped to its closest quantized codebook element in Z. The transformer decoder D decodes the quantized embedding into an approximate reconstruction of the input. We train on a reconstruction loss and commitment loss (Eq. 6). Not only does the VQ-VAE allow us to learn a representation robust to drift from autoregressive inference, it also enables non-deterministic motion synthesis.

task that proved difficult to previous approaches [38] . We additionally experimented with a naive method of concatenating audio and motion, but this resulted in empirically worse results due to overly-long conditioning sequences. Applying cross-modal attention along a temporal sequence also allows different modalities to discover some temporal re-alignment [1] . This is especially helpful for encoding speaker inputs since a speaker's motion may not always align with their speech (e.g. delay for dramatic effect). We compute the Queries Q a for the cross-modal attention operation from the audio input, and the Keys K m and Values V m from the motion. We then apply a series of cross-modal attention blocks on the motion modality, where the audio queries are always computed from the raw audio:

Here, d k is the transformer hidden dimension. The crossmodal transformer outputs an intermediate embedding that incorporates information from both the audio and motion of the speaker. Additional convolutional layers temporally downsample the sequence to match the size of the quantized listener sequence. The final speaker encoding is an embedding m ∈ R (τ +1)×d k . We experimentally verify that this method of fusion outperforms others (Table. 1 ).

We design a transformer-based predictor module, P, to capture long-range correlations in the input data. Building off [45] , we employ full-attention masking on the inputs, which has shown promising results in generating long-range motion in an auto-regressive manner. However, with our discrete latent code representation, our model is additionally able to capture multiple modes of outputs by predicting the distribution of possible next motions. Furthermore, we enable multi-modal inputs by means of cross-attention.

P takes as input the multimodal speaker embedding m as well as the sequence of previously predicted listener motion. Rather than representing the listener quantized motion as a sequence of codebook vectors z q , for the purpose of prediction we use the parallel representation of a sequence of corresponding codebook indices, s = s 1:τ ∈ {1, ..., K} τ . Specifically, we discretize past continuous listener motion x = F L 1:t by encoding it via the pre-trained encoder E and quantization q (Section 3.2). We then obtain the sequence of indices of the nearest codebook entry per element, via I(·), an element-wise inverse-lookup function that returns the index of a given codebook element

Given speaker input m and listener input s 1:τ , the predictor outputs p(s τ +1 ) ∈ R K , the multinomial distribution of the next listener codebook index across the K entries:

We can then sample from p(s τ +1 ) to obtain an index k into the codebook Z. We perform a codebook lookup to retrieve the corresponding quantized element z k of listener motion, which we pass through the decoder D. The output is the predicted continuous future listener motion y =F L t+1:t+1+w of length w. We train our network with a cross entropy loss on the codebook index s τ +1 :

where the target codebook index at τ + 1 is computed from ground truth future facial motion y = F L t+1:t+1+w . At train time, we follow teacher-forcing and use ground truth listener motion y as past listener input. We randomly mask prior timesteps ∈ [1, τ ] to facilitate autoregressive learning. At test time we input zeros for timesteps without prior listener predictions, and adjust the masking to ignore these timesteps. This allows us to autoregressively predict future listener motion for arbitrary length input. No ground truth past listener motion is seen by the network at test time. Figure 4 . Synchrony of expressions between speaker and listener measured by PCC across a sequence. We convert the expression sequence to a 1D lip curvature time-series according to [24] . Ours best matches the synchrony seen in ground truth. NN produces sequences that are too synchronous with the speaker. a+m and m fail to follow the major trends seen in ground truth, such as periods of (a) high synchrony when both the listener and speaker are laughing, and (b) low/no synchrony when the speaker speaks and the listener continues smiling.

Due to the recent COVID-19 pandemic, videotaped interviews have migrated towards teleconferencing platforms that feature a split-screen panel with the host on one side of the screen and interviewee on the other. This setup is especially advantageous for studying face-to-face communication since both individuals directly face the camera. To cover a broad range of expressions from diverse settings and people, we extract the facial motion and audio for 72 hours of videos from 6 YouTube channels. Each channel features a plethora of interviewees and hosts from a variety of backgrounds.

We leverage a state-of-the-art facial expression extraction method, DECA [21] , to recover the 3D head pose and expression coefficients from in-the-wild videos. DECA estimates the pose, expression, and shape parameters according to the FLAME 3DMM [46] . The 3DMM defines 50 expression coefficients along with a 3D jaw rotation (d m = 53), and 3D head rotation in Euler angles as described in Sec. 3.1. For audio, we use sound-source separation [51] to isolate the speaker's voice. We use these expressions, poses, and speaker-only audio as pseudo ground-truth to train our codebook (Eq. 6) and prediction model (Eq. 10). See Appendix C for details. We release this large-scale, novel dataset.

We evaluate our model's ability to effectively translate the speaker's audio and motion into corresponding listener motion. We employ an extensive set of quantitative metrics to measure the realism, diversity, and synchrony of the listener's facial motion. Further, we perform a perceptual study to corroborate quantitative results. All evaluations are done against the raw ground truth listener motion y. We discuss person-agnostic listener models in Appendix D.

Implementation Details. We use w = 8, T = 64, K = 200, d z = 256, t = 32. We add random masking of input past listener motion. While we train on many different input speaker identities, each codebook and predictor model is trained on a specific listener (e.g. person-specific listener behavior for any speaker input). For all, we use a train/val/test split of 70%/20%/10%. Quantitative results are aggregated over all listener models. At test-time, we use nucleus sampling [31] .

To improve the visual perceptibility of our results, we also train a person-specific mesh-to-pixel visualization module to directly translate 3DMM predictions to a picture of the listener ( Figure 1 ). See Appendix B. and video. However, since photorealistic generation is not the main focus of our work, all evaluations are done on the 3D mesh reconstructions, which are the direct outputs of our model.

Evaluation Metrics. Quantifying motion realism is a difficult problem that cannot be reduced to a single metric. We thus evaluate our predictions along multiple axes based on a composition of metrics from prior work. Our evaluation suite is based on the notion that good listeners should display (1) realistic and (2) diverse motion that is (3) synchronous with the motion of the speaker. We assess expression and rotation separately according to these three pillars:

• L2: Distance to ground truth expression coefficients/pose.

• Frechet distance for realism: Motion realism measured by distribution distance between generated and ground-truth motion sequences following [45] . We directly calculate the Frechet distance (FD) [30] in the expression space R T ×dm or the head pose space R T ×3 on the full motion sequence.

• Variation for diversity: Variance in motion across a sequence. We calculate the variance across the time series sequence of expression coefficents or 3D rotations.

• SI for diversity: Diverseness of predictions. As in [68] , we empirically run k-means to cluster all listener expres- sions/rotations from training set. We compute avg. entropy (Shannon index) of the cluster id histogram of predicted sequences. k = 15, 9 for expression, rotation, respectively.

• Paired FD for synchrony: Quality of listener-speaker dynamics measured by distribution distances on listenerspeaker pairs (P-FD). Calculated FD [30] on concatenated listener-speaker expression R T ×(dm+dm) / pose R T ×(3+3) .

• PCC for synchrony: Pearson correlation coefficient (PCC), popular metric used to quantify global synchrony in psychology [6, 57] . Measures how a listener covaries with a speaker over a 1D time series. We calculate lip curvature [24] to measure smile synchrony (Fig. 4) . For rotation, we measure synchrony in up/down head motion (nods).

• TLCC for synchrony: We further analyze the leaderfollower relationship between our generated listeners and the input speakers by calculating the time lagged cross correlation (TLCC) [6] . For x ∈ [0, 60] frames (up to 2s) we shift the speaker forward by x frames and calculate the correlation on the delayed speaker and corresponding listener. The peak correlation indicates when the two time series are most synchronized. We also use this analysis to find the optimal delay for Mirror Delay baseline below.

Baselines. We compare to the following baselines: • NN motion: A segment-search method commonly used for synthesis in graphics. Given an input speaker motion, we find its nearest neighbor from the training set and use its corresponding listener segment as the prediction. We found NN on the full 64-frame sequence to work better than NN on smaller subsequences that are then interpolated together.

• NN audio: Same as above, but we find NN via audio embeddings obtained from a pretrained VGGish [29] model.

• Random: Return a randomly-chosen 64-frame motion sequence of a listener from the training set.

• Median: Simple yet strong baseline exploiting prior that listener is often still. Median expression/pose from train set.

• Mirror: Return the speaker's motion smoothed.

• Delayed Mirror: Here we mirror the speaker's smoothed motion delayed by 17 frames (≈ 0.5 s). While [20] delayed by 90 frames, we analytically found the optimal lag according to time lagged cross correlation as discussed above.

• Let's Face It (LFI) [38] : SOTA interlocutor-aware 3D avatar generation re-trained on our data. Details in Appendix D.

• Random Expression: Walk over 3DMM space; returns a random face at each timestep.

• Ours Random Walk: Walk over codebook indices. Table 1 shows our proposed method outperforms all other competing methods across a variety of metrics. Overall, Ours achieves the best balance of performance across the various metrics. Rather than evaluating on L2 performance alone, our full suite of metrics provides a well-rounded view of the qualities of good listeners. For instance, while Median performs competitively against Ours on L2, it suffers in terms of motion diversity (variation, SI). As a result, this baseline produces less realistic listeners, as noted by our realism metrics (FD, P-FD). However, more variation in the facial gestures is not necessarily better. While NN motion, NN audio, and Random produce diversity similar to real motion, the expression synchrony (PCC) for these baselines is severely lacking. The incongruous listeners hinder the realism of the dyad as a whole (P-FD). That said, a mime that mirrors the speaker like Mirror and Mirror Delay looks uncanny due to excessive variation and synchrony. Ours delicately balances realism, diversity, and synchrony.

The weaker performance of LFI [38] demonstrates the advantages of our approach. LFI [38] was far less robust when re-trained on our in-the-wild data. Unable to learn realistic listener motion, LFI [38] Table 2 . Ablations. Effect of ablating key components of our method. ↓ indicates lower is better; for no arrow, closer to GT is better. CA denotes cross-attention. We bold best performances that are statistically significant. For FD and P-FD, results shown in units indicated above. and worse realism (FD, P-FD). Even when evaluated on the LFI [38] dataset, ours outperforms. These results and visual comparisons in Appendix D.

Additionally, we quantitatively demonstrate a major advantage of our method's VQ-VAE in learning a robust and realistic manifold of listener motion. Ours Random Walk is competitive against Random, where we sample full sequences of real motion. It significantly outperforms Random Expression, where we randomly sample static expressions and rotations at each timestep. This demonstrates that random walks along the codebook still produce realistic motion, though it may not be in sync with the speaker.

Finally, the average TLCC calculated for GT and Ours were both ≈ 17 frames, both reflecting an average listener response time of ≈ 0.5s. As mentioned above, we use this response time as the optimal delay for Mirror Delay baseline. See Appendix D for full analysis.

Model Ablations. Table 2 quantifies the contributions of each component of our method. In NoVQ a+m, we remove the VQ-VAE and use raw listener motion as the input and output representations. NoVQ a+m produces unrealistic, overly smoothed sequences. Adding the VQ-VAE gives a significant performance boost, which further confirms the importance of the codebook in generating realistic motion. Furthermore, we demonstrate that utilizing both audio and motion as input a+m via concatenation slightly improves performance over using just one or the other (a and m). However, Ours achieves a more substantial improvement when combining both modalities via cross-attention (CA). See Appendix D for details of ablation architectures.

To corroborate our quantitative results and gain insight into how our synthesized listeners perceptually compare to real motion , we conducted an A/B test on Amazon Mechanical Turk. Since all quantitative trends were consistent across all listener identities, we randomly chose a single identity for the evaluation. We visualized listener motion using videos of grayscale 3D facial meshes.

Participants watched a series of video pairs. In each pair, one video was generated from our model; the other was produced by an ablation or a baseline. Participants were then asked to identify the video containing the listener that looks like they are listening and paying more attention to the speaker. Videos of 8 seconds each of resolution 849 × 450 (downsampled from 1132 × 600 in order to fit two videos vertically stacked on different screen sizes) were shown, and after each pair, participants were given unlimited time to respond. Since the most tell-tale moments for when a listener is truly listening are during defining moments (speaker tells a joke, shares a sad story, etc.) that illicit strong responses, we manually curated such notable moment sequences from our held-out test data. We then randomly sampled 50 from these sequences and predicted a corresponding listener 3D facial motion sequence using each method. For every test sequence, each A/B comparison was made by 3 evaluators.

We compared our strongest baseline NN motion and ablation a+m to our proposed model and recorded the percentage of times our method was preferred over the baseline models or vice versa. Ours significantly outperformed. 75.3% of the total 150 evaluators preferred Ours over NN, and 71.1% preferred Ours over a+m. These statistics reflect the quantitative trends in Table 2 . Furthermore, in a comparison against avatars rendered from ground truth listeners, evaluators preferred Ours 50.1% of the time. This highlights the perceptual realism of our predicted listener motion.

In this work, we explored the synchronicity of motion between a speaker and a listener. To this end, we employ a motion-audio cross-attention transformer to handle the multiple modalities of speaker inputs. Furthermore, we enable non-deterministic motion synthesis with a VQ-VAE. Trained on a novel, in-the-wild dataset of dyadic conversations, our method autoregressively outputs convincing 3D listener facial motion that correlate with a given speaker.

While videotaped teleconferencing data lends itself to data collection, it has inherent limitations (e.g. no eye contact, time delays introduced by remote connections, etc.). A future direction would be to apply this study to in-person conversations, which would allow us to incorporate gaze. Furthermore, as we only model listener motion in response to a speaker, modeling the full dyadic cycle of back-andforth effect remains for future work. While our goal is to understand conversational dynamics, we discuss concerns for misuse of this technology in Appendix. Please see Appendix per-listener results, implementation details, ablation architectures, multiple mode output evaluation, etc.

Acknowledgements. The authors would like to thank Justine Cassell, Alyosha Efros, Alison Gopnik, Jitendra Malik, and the Facebook FRL team for many insightful conversations and comments. Dave Epstein and Karttikeya Mangalam for Transformer advice. Ruilong Li and Ethan Weber for technical support. The work of Ng and Darrell is supported in part by DoD including DARPA's XAI, LwLL, Machine Common Sense and/or SemaFor programs, which also supports Hu and Li in part, as well as BAIR's industrial alliance programs. Ginosar's work is funded by the NSF under Grant # 2030859 to the Computing Research Association for the CIFellows Project. Parent authors would like to thank their children for the daily reminder that they should learn how to listen.

Interactional Motion in Human Communication. Humans are able to enter into synchronous negotiations with others from early infancy [61] . This interaction primitive is so central to human communication that infants believe anything that behaves synchronously with them is an independent agent even if it not human-like in appearance [36] . The infant-mother dyad of affective synchrony of motion [48, 58] is of particular importance. Face-to-face infant-mother synchrony has long-term consequences-it can predict a child's temperament years later [19] .

The study of face-to-face communication has been classically hindered not only due to lack of data, but also due to computational methods that could analyze it. The earliest studies of motion patterns in dyadic conversations involved manual analysis of videotaped data [17, 41] . Condon and Ogston [17] describe an interactional synchrony where the motion of the listener flows in rhythm with the speech and motion of the speaker. Kendon [41] extends their study of head motion to that of the upper body, manually transcribing motion patterns of speakers and listeners in video recordings. While several coding systems for interaction synchrony [16, 34, 62] were later developed, they all still required laborious manual annotation efforts. With the advent of modern computer vision, frame-differencing methods provided much-needed automation [52] .

Several works computationally demonstrate the existence of distinctive interactional motion in face-to-face interactions. Riehle et al. employ correlation analysis [6] on electromyography (EMG) signals of recorded muscle activations and suggest that people typically synchronize their smiles with those of their interlocutors within 1 second. Other methods for detecting synchrony use facial 2D [15] or full-body 3D [22] keypoint detectors. Our method leverages the existence of interactional syncrhony to learn realistic facial motion of a listener in response to a speaker's speech and motion by training on a large dataset of in-the-wild conversations.

VQ-VAE Details. The VQ-VAE is composed of 3 convolutional layers of kernel size 5, stride 1, padding 2. Each convolutional layer is followed by a max pooling operation. Given a sequence length 32, following the convolutional layers, we have a sequence length 4 (window size w = 8. Figure 5 . Cross-modal transformer detail. We combine the different modalities of the speaker's input via a cross modal transformer. Each modality first passes through a linear layer. Then, a stack of cross attention blocks treats the audio as queries and the speaker's motion as the keys and values. Finally, 1D convolutions temporally downsample the output, to match the temporal extent of the quantized listener input. up steps. We optimize using Adam with a batch size of 32. Train/val/test split is 70/20/10. We then use the frozen model downstream to quantize the listener inputs to the Predictor. During test time, the average L2 error induced by quantization on the test set is 11.32 on expression and 1.02 on rotation.

Cross-modal Transformer Details. The cross-modal transformer takes as input the raw motion representation mentioned in the main text (Sec. 3.1). To process the audio, we use the audio processing library, librosa, to process the audio at a sampling rate of 16000, and to obtain the melspectrogram. This results in a sequence 4 times longer than the motion sequence extracted at 30fps. As a result, we apply max pooling to downsample the audio to a sequence length that matches the motion sequence. We feed the downsampled audio and the motion independently through a Linear layer for each modality to obtain their respective projected embeddings. These are then fed into the cross-modal transformer shown in Figure. 5 . The cross-modal transformer is composed of a transformer of hidden size 1024, number of heads 8, number of layers 12. Following the transformer are 3 convolutional layers of kernel size 5, stride 1, padding 2. Each convolutional layer is followed by a max pooling operation that temporally downsamples the speaker embedding of length 32 to match the size of the listener embedding (τ = 4). The positional encoding of the transformer is learned.

While we first attempted to not downsampling the speaker with convolutional layers, feeding the full 32 length sequence into the Predictor as the speaker input resulted in empirically worse results across the board. We suspect this is due to overly long conditioning and a temporal mismatch of the speaker's much longer token sequence with the listener's shorter quantized sequence. This would require the network to learn the temporal alignment of the two different length sequences on it's own.

Predictor Details. Our predictor is composed of a transformer with hidden size 200, number of heads 10, and number of layers 5. To convert the sequence of listener indices ∈ R τ ×1 to a embedding ∈ R τ ×d k that matches the size of the speaker embedding, we embed the indices using the Embedding function in Pytorch. We then concatenate the output of the cross modal speaker embedding with the listener embedding to get a sequence ∈ R 2τ ×d k , which serves as input to the predictor.

Since the Transformer is a set-to-set operation, it outputs a sequence of length 2τ matching the input. During training we take the first 4 indices of the output and discard the remainder. The additional 3 indices are used as a temporal regularizer similar to [41] . We train the predictor jointly with the cross-modal transformer for 1000 epochs (≈ 12 hours on 8 GPU's) with a learning rate of 0.01 with 4,000 warm-up steps. During test-time, we take only the first index.

To facilitate autoregressive learning, we mask out the attention on portions of the past listener input. Given a listener sequence of length τ , we end up with an attention matrix τ × τ . During training time, we generate a random number x = [0, τ ] 50% of the time and set the attention for rows above τ , representing time-steps preceding x, negligibly low. For the remaining 50%, we do not mask anything out. During test time, we start by masking everything out and then gradually reduce the mask as we autoregressively predict the output. This ensures we only see past listeners we predict.

Visualization. To improve the visual perceptibility of our results, we also train a person-specific 3DMM-to-video translation network.We adopt the state-of-the-art video-to-video synthesis method [65] to translate the grey scale 3DMM rendering results into full frames of a photo-realistic target video, in which the target listener mimics the facial expression and the head motion of the grey scale visualization. Our network learns to simulate the static background and the entire listener, where the face region is conditioned on the 3DMM rendering, while other components, such as hair and torso are compiled with the head pose. During the training session, the training data is extracted from a single video clip. The ground truth targets are the frames of the listener in the video, and the sources are the renderings of the corresponding 3DMM predictions.

Pursuant to guidance from our local IRB we have determined that there is no non public PII and no human subjects under 45 CFR 46 in our dataset.

Dataset Preprocessing. All frames are extracted at 30 fps. We pre-process the video data by automatically removing irrelevant segments and annotating listener versus speaker splits. While these interview videos often contain views of both the host and the guest in a split-screen format, the view often switches back to a single individual, or to an inserted picture. We employed an off-the-shelf facial detection method (DECA [Feng et al. 2021] ) to automatically find all relevant segments of two conversing individualse.g. removing parts of the video with less than two faces detected, with a face detected in the center of the screen and an extraneous false positive in the background, or with a face that was static for an extended period. Furthermore, to minimize the noisiness of the extracted pseudo-ground truth annotations, we removed frames where most of the detected 2D keypoints estimated by DECA were missing, and sequences with a sudden extreme movement.

To ensure that we feed speaker segments as input and use listener segments as pseudo ground truth labels, we additionally extract listener versus speaker splits for each video. We automatically detect the splits by using active speaker identification [Chung et al. 2016] in conjunction with sound source separation [Owens et al. 2018 ]. We found that using either method independently led to noisy speaker predictions for different reasons. Combining the two in a voting-based approach led to more reliable splits. We removed sequences where both individuals speak at the same time, but kept sequences where the listener utters a few words (e.g. to show agreeance).

Ablation Architectures. In Table 1 of the main paper, we perform multiple ablations to test the validity of our method's components. We further describe and sketch out the ablation architectures as shown in Figure 6 .

LFI [Jonell et al. 2020 ] Implementation Details. Since we could not get access to their trained model nor their video dataset, we needed to retrain their method on our video dataset. Training on our dataset would also allow us to compare both methods more fairly, since there is a large distribution shift between their dataset and ours. Rather than changing their model, we use the same setup provided (inputs, configs, etc. ) and simply retrain on our dataset. We use the config including both audio and motion for the listener and speaker, which may provide their model with slightly more context as described in Section 2 of the main paper. We use the same optimizer, learning rates, and training schedule as defined by the authors of the paper. We train the model for 3 days on 8 GPU's. Table 2 . a) NoVQ a+m: We remove the VQ-VAE component, taking as input the raw listener and outputting the raw listener. We take as input only the b) motion m or the c) audio a from the speaker, replacing the cross-modal transformer with a normal transformer for the single-modal case. d) a+m: Rather than using a cross-modal transformer, we simply pass each modality through a transformer and perform fusion via concatenation.

Please note, we excluded two of our listeners due to the fact that they include guest hosts who serve the role in many videos. Therefore we did not have enough data to learn a person-specific model with the remaining data. However, we will still release these portions of the data.

The decision to train different models per-person comes from the idea that each person has a characteristic way of listening. By training person-specific models, we can capture these more fine-grain characteristic details of listener motion, which is difficult to do when considering all identities at once.

Listener-agnostic modeling. While in the paper we focus on listener-specific models, experiments show that our method can be extended to new listeners. In a listeneragnostic setup, we train across many listeners and test on held-out listeners and speakers. All baselines in Tab. 1 are performed on listener-specific data, but when recomputed in an agnostic setup, the relative ordering of baselines does not change, with Ours-agnostic first (FD: 30.01, PFD: 31.36) and NN motion-agnostic as second best (FD: 57.29, PFD: 57.91)(lower is better). Even in a listener agnostic setup, our approach outperforms existing baselines. While our person-agnostic model does well, person-specific modeling better captures individualistic listening styles and mannerisms [20] .

Evaluation on LFI [34] dataset. As [34] only provides facial annotations (no audio/video), we did not include an evaluation against this data in the main paper. Still, we can use their data in a motion-only setting with no speaker audio input. As the models for LFI are not publicly available, We retrained LFI and our ablated model m using their facial annotations and train/test splits. On their data, our m (FD: 1.88, PFD: 2.12, var: 1.82) outperforms LFI (FD: 2.97, PFD: 3.10, var: 1.01).

Metrics Details. Here, we further define our metrics. We denote x ∈ R N ×T ×F as the input speaker motion sequence, y ∈ R N ×T ×F as the ground truth listener motion sequence andŷ ∈ R N ×T ×F as our prediction. N is the number of test sequences, T length of the sequence, and F feature dimension.

where Σ is the covariance matrix.

• variation: We calculate the variance along the axis representing the temporal extent T of the sequence. We then average over B and F .

• SI: We empirically perform kmeans clustering to cluster the expression and rotation separately as defined in the main paper. We then compute the entropy (Shannon index) of the cluster ID histogram of all the samples. We report the entropy in the Tables.

• P-FD: We use the exact same equation as above FD, but rather than using y andŷ, we replace them with the concatenation y x andŷ x respectively.

• PCC: The correlation coefficent is given as:

where t ∈ [0, T ] is the timestep, andx denotes the mean of x. Table 4 . Person B. Comparison against ground truth annotations (GT) on in-the-wild data. ↓ indicates lower is better; for no arrow, closer to GT is better. We bold best performances that are statistically significant. For FD and P-FD, results shown in units indicated above. Table 6 . Person D. Comparison against ground truth annotations (GT) on in-the-wild data. ↓ indicates lower is better; for no arrow, closer to GT is better. We bold best performances that are statistically significant. For FD and P-FD, results shown in units indicated above.

Ours a+m a m NN GT Figure 7 . For each method, we feed in the same 256 length test sequences, and sample x ∈ [1, 200] output listener motion sequences. From the x sampled trajectories, we calculate the avg. minimum L2 distance to ground truth. For NN, we sample the top 200 64-length sequences. The loss from quantizing GT is shown (L2 = 17.13). While Ours and a+m start at around the same L2, ours reaches a lower L2 in fewer samples.

Multiple Modes of Output Analysis. The benefits of our model are further shown in Figure 7 which tests how well each method captures the distribution of listener-speaker dynamics in the dataset. For each method, we feed in the same 256 length test sequences, and sample x ∈ [1, 200] output listener motion sequences. From the x sampled trajectories, we calculate the avg. minimum L2 distance to ground truth. For NN, we sample the top 200 64-length sequences. The loss from quantizing GT is shown (L2 = 17.13). Ours requires fewer sampling steps to achieve motion that is closer to the actual ground truth listener motion. While a+m starts out at a similar L2, it takes far more samples to reach a lower L2. This indicates that our method better models the distribution of actual listener-speaker dynamics seen in the dataset.

TLCC Analysis. As mentioned in the paper, we perform a time lagged cross correlation analysis to get a sense of the leader-follower relationship between the speaker and listener, as well as to analytically find the optimal delay for our Mirror Delay baseline. We provide the full analysis for all of the baselines here in Table 7 . The results demonstrate that the difference between the measured TLCC between GT and Ours is not significant. 

While our proposed framework is intended solely for improving the naturalness of next generation human-machine interaction systems, such as virtual assistants and other entertainment applications, our AI-synthesized video output component produces highly realistic looking humans, that can be confused with a real person. Similar to many advanced video synthesis techniques (such as face-swap deepfakes, facial reenactment algorithms), such pipeline could be potentially miused by malicious actors, as the results can be highly convincing and the algorithm easy to replicate as long as the data is available. For instance, a fake conversation between two person could be constructed while it never occurred, or a fake video chat participant could be used to attend an unauthorized meeting. While it can be difficult to fully prevent the misuse of such technology, we believe that this paper can help raise awareness of such capabilities, and we advocate for a safe use of video-synthesis technique using watermarks labeling the content for example as 'synthesized'. Furthermore, we believe that our speakerto-listener translation approach and prediction models can inspire future video manipulation detection algorithms to look for additional cues in listener's performance.

No gestures left behind: Learning relationships between spoken language and freeform gestures

To react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations

Structured prediction helps 3d human motion modelling

A morphable model for the synthesis of 3d faces

Facilitating multiparty dialog with gaze, gesture, and speech

Windowed cross-correlation and peak picking for the analysis of variability in the association between behavioral time series

Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation

Facewarehouse: A 3d facial expression database for visual computing

Long-term human motion prediction with scene context

Animated conversation: Rulebased generation of facial expression, gesture and spoken intonation for multiple conversational agents

The power of a nod and a glance: Envelope vs. emotional feedback in animated conversational agents

The chameleon effect: the perception-behavior link and social interaction

Synchronized affect in shared experiences strengthens social connection

A face-to-face neural conversation model

Unsupervised synchrony discovery in human interaction

Mother-infant faceto-face interaction: The sequence of dyadic states at 3, 6, and 9 months

Sound film analysis of normal and pathological behavior patterns

Taming transformers for high-resolution image synthesis

Mother-infant affect synchrony as an antecedent of the emergence of self-control

Learn2smile: Learning non-verbal interaction through observation

Learning an animatable detailed 3D face model from in-thewild images

Yuvalal Liron, and Uri Alon. A reduceddimensionality approach to uncovering dyadic modes of body motion in conversations

Learning individual styles of conversational gesture

A century of portraits: A visual historical record of american high school yearbooks

Virtual rapport. In International Workshop on Intelligent Virtual Agents

Predicting head pose in dyadic conversation

Automated video analysis of non-verbal communication in a medical setting

Moglow: Probabilistic and controllable motion synthesis using normalising flows

Cnn architectures for large-scale audio classification

Gans trained by a two time-scale update rule converge to a local nash equilibrium

The curious case of neural text degeneration

Virtual rapport 2.0

Dyadgan: Generating facial expressions in dyadic interactions

Maximally discriminative facial movement coding system

Perceiver io: A general architecture for structured inputs & outputs

Whose gaze will infants follow? the elicitation of gaze-following in 12-month-olds

Learning non-verbal behavior for a social robot from youtube videos

Let's face it: Probabilistic multi-modal interlocutoraware generation of facial gestures in dyadic settings

Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction

Believable virtual characters in human-computer dialogs

Movement coordination in social interaction: Some examples described

Generative flow with invertible 1x1 convolutions

Nonverbal synchrony and rapport: Analysis by the cross-lag panel technique

Learning to generate diverse dance motions with transformer

Learn to dance with aist++: Music conditioned 3d dance generation

Learning a model of facial shape and expression from 4D scans

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Automated measurement of facial expression in infant-mother interaction: A pilot study

Body2hands: Learning to infer 3d hands from conversational gesture body dynamics

Interactive generative adversarial networks for facial expression generation in dyadic interactions

Audio-visual scene analysis with self-supervised multisensory features

Frame-differencing methods for measuring bodily synchrony in conversation

A 3d face model for pose and illumination invariant face recognition

Improv: A system for scripting interactive actors in virtual worlds

Actionconditioned 3d human motion synthesis with transformer vae

Nonverbal synchrony of head-and body-movement in psychotherapy: different signals have different associations with outcome

Quantifying facial expression synchrony in face-to-face dyadic interactions: Temporal dynamics of simultaneously recorded facial emg signals

Infants time their smiles to make their moms smile

Behavioural facial animation using motion graphs and mind maps

A conversational agent framework with multi-modal personality expression

Communication and cooperation in early infancy: A description of primary intersubjectivity. Before Speech (Cambridge)

Monadic phases: A structural descriptive analysis of infantmother face to face interaction

Multimodal transformer for unaligned multimodal language sequences

Neural discrete representation learning

Video-to-video synthesis

Feedback delays can enhance anticipatory synchronization in humanmachine interaction

Spacetime faces: High-resolution capture for˜modeling and animation

Generating 3d people in scenes without people

State of the art on monocular 3d face reconstruction, tracking, and applications