BEAT

A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

ECCV 2022

¹The University of Tokyo ²Keio University ³Digital Human Lab, HuaWei Technologies ⁴Japan Advanced Institute of Science and Technology

We present a new conversational gestures dataset (BEAT) with cascaded motion network (CaMN) model as a baseline for synthesis realistic, vivid and human-like conversational gestures. BEAT contains 76-hour 3D motion from the motion capture system, paired with 52D facial blendshape weights, audio, text, semantic relevancy and emotion categories annotations.

Paper Overview

Semantic Relevancy Annotations

In order to develop and evaluate the semantic relevancy between gestures and speech content, we provide a score and category-label for each frame: no gestures (0), beat gestures (1), low-middle-high quaility deictic gestures (2-3-4), iconic gestures (5-6-7), metaphoric gestures (8-9-10). The defination of gesture categories is from Four-Type-Gestures. The annotation tool is based on VGG Image Annotator and BABEL, and examples in the above video are in Annotation Tools.

Emotional Gestures

Here we show raw captured gestures data in with 8 emotions for each speaker. The emotion categories are neutral, happiness, anger, sadness, contempt, surprise, fear and disgust. We also render all data in All Renderd Videos.

Multi-Language Data

BEAT contains recording in English (60h), Chinese (16h), Spanish (2h) and Japanese (2h). All Non-Engish speakers have the English recording with the same text content for comparsions.

30 Speakers with Self-Talk and Converstion Recording

Half of BEAT dataset are Self-Talk (read predefined text) recording, which is proposed to explore the personality of different speakers with the same speech content. We list below a representative sample of six videos that include speakers of different ethnicities and genders. Each recording is one minuate with total 118 recordings.

The other half of BEAT dataset are Converstion Recording (chat with director), Each recording is 10 minuates with total 12 recordings.

Bibtex

@article{liu2022beat, title={BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis}, author={Liu, Haiyang and Zhu, Zihao and Iwamoto, Naoya and Peng, Yichen and Li, Zhengqing and Zhou, You and Bozkurt, Elif and Zheng, Bo}, journal={arXiv preprint arXiv:2203.05297}, year={2022}}

More Thanks

We thank Hailing Pi for communicating with the recording actors of the BEAT dataset. The website is inspired by the template of pixelnerf.

Licensed under the Non-commercial license.