1The University of Tokyo 2Keio University 3Digital Human Lab, HuaWei Technologies 4Japan Advanced Institute of Science and Technology
We present a new conversational gestures dataset (BEAT) with cascaded motion network (CaMN) model as a baseline for synthesis realistic, vivid and human-like conversational gestures. BEAT contains 76-hour 3D motion from the motion capture system, paired with 52D facial blendshape weights, audio, text, semantic relevancy and emotion categories annotations.
In order to develop and evaluate the semantic relevancy between gestures and speech content, we provide a score and category-label for each frame: no gestures (0), beat gestures (1), low-middle-high quaility deictic gestures (2-3-4), iconic gestures (5-6-7), metaphoric gestures (8-9-10). The defination of gesture categories is from Four-Type-Gestures. The annotation tool is based on VGG Image Annotator and BABEL, and examples in the above video are in Annotation Tools.
Here we show raw captured gestures data in with 8 emotions for each speaker. The emotion categories are neutral, happiness, anger, sadness, contempt, surprise, fear and disgust. We also render all data in All Renderd Videos.
BEAT contains recording in English (60h), Chinese (16h), Spanish (2h) and Japanese (2h). All Non-Engish speakers have the English recording with the same text content for comparsions.
Half of BEAT dataset are Self-Talk (read predefined text) recording, which is proposed to explore the personality of different speakers with the same speech content. We list below a representative sample of six videos that include speakers of different ethnicities and genders. Each recording is one minuate with total 118 recordings.
The other half of BEAT dataset are Converstion Recording (chat with director), Each recording is 10 minuates with total 12 recordings.
We thank Hailing Pi for communicating with the recording actors of the BEAT dataset. The website is inspired by the template of pixelnerf.
Licensed under the Non-commercial license.