key: cord-0451521-8io1sh0n
authors: Kim, Kangyeol; Park, Sunghyun; Lee, Jaeseong; Chung, Sunghyo; Lee, Junsoo; Choo, Jaegul
title: AnimeCeleb: Large-Scale Animation CelebFaces Dataset via Controllable 3D Synthetic Models
date: 2021-11-15
journal: nan
DOI: nan
sha: df78143f9871d54403b484e4d7fad316b0c2eafc
doc_id: 451521
cord_uid: 8io1sh0n

Despite remarkable success in deep learning-based face-related models, these models are still limited to the domain of real human faces. On the other hand, the domain of animation faces has been studied less intensively due to the absence of a well-organized dataset. In this paper, we present a large-scale animation celebfaces dataset (AnimeCeleb) via controllable synthetic animation models to boost research on the animation face domain. To facilitate the data generation process, we build a semi-automatic pipeline based on an open 3D software and a developed annotation system. This leads to constructing a large-scale animation face dataset that includes multi-pose and multi-style animation faces with rich annotations. Experiments suggest that our dataset is applicable to various animation-related tasks such as head reenactment and colorization.

Over the years, animation characters have walked along with human beings, acting as beloved friends and giving emotional comfort to many people in daily life. Along with their popularity, animation characters have expanded their roles, and are being widely adopted from the entertainment industry to the marketing field and even for educational purposes. Moreover, recent advances in computer vision and graphics have further expedited the extensive spread of characters, paving a way for individual creators to easily design their own characters using 3D software (e.g., Blender 1 and Maya 2 ) and showcase their work on public online platforms. Versatile features such as customizable texture and morphing operations of 3D models have opened a new door for animation characters to be applied to broad real-world situations.

In this paper, we focus on constructing a 2D animation face dataset, aiming to tackle animation-related tasks including animation head reenactment [17, 30, 32, 37, 41, 42] and colorization [19, 21, 22, 46] . For this purpose, we utilize the collected user-generated animation 3D models as a data factory, where both rendered images and corresponding rich annotations such as various morph attributes and rendering styles can be easily sampled. For the sake of utilizing 3D models efficiently, we build a semi-automatic pipeline where both manipulated 2D face images and corresponding controlled attributes are obtained using Blender, an open source computer graphics software. With the aid of the pipeline, we construct a large-scale animation face dataset, namely AnimeCeleb, which is applicable to existing 2D image-based methods in the animation domain [1-3, 5, 14, 31, 47] .

Existing public animation 2D face datasets [2, 5, 47] have relied on web crawling during dataset construction. This unnecessarily leads to two problems: (i) Lack of detailed annotations along with dataset, requiring additional postprocessing to acquire further annotations and (ii) Inconsistency of dataset with excessive drawing styles and scenes variances. Considering the data dependency of deep learning, these problems are the main obstacles that restrict the use of 2D face dataset applications rather limited tasks. We show that exploiting the power of 3D software and 3D animation models can be an effective solution to solve these problems.

First, detailed annotations such as facial expressions and head rotation can be easily gained because an image is generated accompanied by a known manipulation. Second, we are able to fully control the sampling environment where multi-pose images with the same identity are available. As can be seen in Table 1 , compared with public existing animation face datasets [2, 5, 47] , AnimeCeleb relies on 3D model rendering to construct the animation face dataset. This ensures the production of a large-scale dataset that contains detailed annotations as well as multi-pose images with the same identity. In addition, AnimeCeleb contains a variety of styles in consideration of different manners of drawing.

The data generation process involves 3D animation model collection, semantic annotation and pose-conditional image sampling. To be specific, during semantic annotation, we develop an annotation mapping tool to match, filter and group unorganized morphs of the 3D animation model. We employ Blender to process script-level commands for visualization and automatic image rendering. To reveal applicable tasks using AnimeCeleb, we implement four animationrelated tasks: animation head reenactment, animation colorization, animation image-to-image translation and back-ground/foreground harmonization and show promising results on in-domain as well as out-of-domain animation face samples.

In summary, our contributions to the domain of animation research are as follows:

• We present a public large scale animation face dataset called AnimeCeleb, which contains a set of highquality images and corresponding rich annotations. • We propose an efficient and novel data generation pipeline to extract animation face images from 3D animation models. • By employing AnimeCeleb we show various applicable tasks, such as animation head reenactment and colorization in the animation domain, demonstrating the usefulness of AnimeCeleb.

Animation Face Dataset Although there are abundant animation images online, collecting and annotating an animation character face dataset is not an easy task because open source face detectors used during web crawling can malfunction due to the extreme complexity of animation texture. Due to this issue, only a few animation datasets [2, 5, 14, 31, 47] have been released to the community. However, existing datasets collected from unlisted sources online remain unorganized and noisy, narrowing their applicability for developing data-driven models. For example, previous work on head reenactment [7, 17, 32, 37, 41, 42] in recent years requires two frames extracted from the same video for training, which is infeasible using existing animation datasets. Departing from previous dataset collection approaches, we propose to utilize the power of synthetic 3D animation models for a dataset generator. A similar approach [3] has been proposed for commercial usage using 3D animation models to construct an animation face dataset, that was not released to the community. To our best knowledge, our work is the first publicly available animation face dataset that contains multiple aligned images with consistent texture. Specifically, AnimeCeleb is distinguishable from previous approach [3] in terms of data generation pipeline and diverse rendering styles. Synthetic Human Face Dataset Prior to AnimeCeleb, there were numerous attempts to generate synthetic human face data using 3D human models. A common approach is to take advantage of a 3D Morphable Model (3DMM) [4] , a controllable parametric face model, that can provide diverse face shapes. Early studies focused on usung 3DMMs to construct parts of the face including the eye region [35] or hockey masks [43] . Recent work [39] proposes a method to synthesize a realistic and diverse face training dataset by procedurally combining the 3D parametric face model with artist-created assets(e.g., hair, clothing). Our proposed dataset shares a similar spirit to previous work in synthetic human faces; a 3D graphics pipeline is used to construct a large-scale face dataset.

Vision for Animation Rapid development on computer vision and graphics has led to research progress in the animation field. Early studies focused on recognizing and detecting [28, 36, 44] the characters appearing in animation scenes, verifying the effectiveness of neural networks in processing the given images. Recently, with the advancement of generative modeling, many studies have been conducted on enhancing and generating animation contents. For example, animation colorization has been widely studied [15, 23, 29] for practical applicatio; in this, the authors reduce the inputs of human labor and time for colorization. Furthermore, generative adversarial networks [16] promote the development of deep learning models for generative tasks in the anima-tion field, e.g., style transfer [8, 9] , image generation [20, 40] and video interpolation [34] . We believe that AnimeCeleb is able to boost the research progress on various tasks in the animation domain.

In this section, we first describe each step of the data generation process in Section 3.1. Next, details of AnimeCeleb including dataset properties and dataset statistics are given in Section 3.2. (i.e., PMX files) from two different web sites: DevianArt 3 and Niconi solid 4 through manual downloading and crawling. Because that all models are copyrighted by their creators, we carefully confirmed the scope of rights and obtained permission from the reachable authors. Finally, we acquired 3613 usable 3D animation models in total after filtering out inappropriate appearance animation models. 3D Animation Model Descriptions (A.2) The collected 3D animation models contain not only full body information of animation characters such as 3D mesh, bones and texture components but also morphs that can alter the appearances of the 3D models. Most morphs are associated with a particular part of the body; for example, some morphs affect the arm shape of one character or another. The intensity of a morph is adjustable within [0, 1] thus, assigning a scalar value for each morph, we can vary the related attributes of the 3D model (e.g., open/close the mouth, open/close the eyes). In this paper, we focus on meaningful morphs concerned with facial expression. Furthermore, the head angles (i.e., yaw, pitch and roll) are governed by applying a rotation matrix to the neck bone, enabling us to manipulate the head pose by specifying three Euler angles. Image Rendering (B) For the sake of automating the sampling process of an animation face image from 3D animation model, we newly develop a 2D face image generation system built on Blender, an open source 3D computer graphics software that supports the visualization, manipulation and rendering of 3D models. To successfully render the animation face images in Blender, we need to consider three aspects: (i) Camera alignment, (ii) Light condition, and (iii) Image resolution.

First, with the aim of capturing the face out of the full body without supervision, we detect a bone named neck(首 in Japanese), and align the camera vertically facing the neck along the y-axis. This allows us to crop part of the face saving laborious camera position adjustment. Second, with re- spect to the light field during rendering, we use a directional light point along the negative y-axis with the intensities uniformly drawn from the range [0.3, 0.9]. Lastly, we set the resolution of the rendered image as 256 × 256; we also set the alpha channel so that the fully transparent background is retainer. The alpha channel can be utilized to separate foreground and background, which can be useful to some tasks such as animation segmentation and harmonization. Semantic Annotation (C) Each 3D animation model has a significantly different number of morphs; while particular animation models have no morphs yet, some models have more than 100 morphs. Beyond that issue, morph naming conventions vary according to the creator, which makes it difficult to apply a standardized criterion to discover accurate semantics of an individual morph. The goal of the semantic annotation step is to identify facial expressionrelated morphs and annotate the morphs according to a semantically accurate and unified naming convention. Importantly, this enables us to apply a consistent pose sampling policy for all 3D animation models when sampling facial expression-related morphs. For example, when a morph あ of a 3D animation model is identified as a facial expressionrelated morph and annotated by altering a mouth shape, such as that for pronouncing the syllable 'ah', we can use あ this model to manipulate the facial expression.

To achieve semantic annotation, we first define 23 target morphs, that are considered as meaningful semantics to represent the facial expression. We denote original morphs as source morphs in the reminder of this section. Fig 3-left shows examples of the target morphs that include meaningful semantics for three parts: eyes, eyebrows, and mouth. Next, we attempt to match the source morphs with the tar- get morphs via manual annotation.

However, morph-to-morph annotation is very timeconsuming considering the number of 3D animation models and source morphs. Fortunately, source morphs with identical names tend to have the same semantics. Therefore, we take a two-stage approach: group annotation and individual inspection. The former is used annotate all the same named source morphs as a target morph; the latter is responsible for inspecting each source morph to check whether it is functioning correctly. Individual inspection is conducted by hand and successfully reduce the erroneous matching caused by the group annotation.

For this, we first render the images after applying the source morphs and neutral images of all 3D animation models. Afterwards, we match the source morphs with one of the target morphs and confirm matching by comparing neutral and morph-applied image for each morph of a 3D animation model on the newly developed annotation tool. We provide details of the defined target morphs and the annotation tool in the supplementary material. Data Sampling (D) For sampling, randomly sampled target morphs for each part (i.e., eye, eyebrow and mouth) are applied to the 3D animation model. The magnitudes of the morphs are determined by sampling from a uniform distribution of [0, 1] independently. In addition, a 3D rotation matrix is also computed using yaw, pitch and roll values sampled between -20 • and 20 • ; the matrix is used to transform the character head pose. All sampled values are saved as a pose vector p ∈ R 20 paired with a rendered image. In the cases of eyes and eyebrows, expression parts of the pose vector are sampled from the target morphs: {both-semantic, right-semanticandleft-semantic}, to determine values of right-semantic and left-semantic. A detailed description of the pose sampling process is provided in the supplementary material.

Blender provides a real-time rendering engine built using OpenGL 5 , called Eevee, which enables us to produce 3D character style 2D images. We diversify the rendering effect by utilizing different types of shader as shown in Figure 2 to provide more diverse textured 2D images. Since morphs and head rotation are applied separately, two type of partitions: a set of frontalized images with expression (frontalized-expression) and head rotated images with expression (rotated-expression) are included in the dataset.

The numbers of images sampled from the 3D model are set differently depending on the number of annotated target morphs that the 3D model has. When the 3D model has more than five annotated target morphs, we generate 100 images;if not, just 20 images are obtained for that model. 

Overview To reveal the benefits of using AnimeCeleb, we implement four tasks: Animation head reenactment, Animation co-part segmentation, Animation colorization, and Image harmonization. For training and evaluation using Ani-meCeleb, we employ recent representative baselines: animation head reenactment [30, 32] , animation co-part segmentation [33] , and animation colorization [19, 22] . We split the whole dataset into a training dataset of 3319 and a test dataset of 294 3D animation models. Different shader styled images are used to train models: (S.1) for animation head reenactment, co-part segmentation and (S.3) for animation colorization, image harmonization. Unlike those tasks that require a training phase, we conduct an image harmonization experiment using an optimization-based approach [45] . Task Motivations During task selection, we focus on showing the applicability to animation-related tasks and to ex- Figure 6 . Animation cross-identity head reenactment results. Given a source image (1st column) and a driving image (2nd column), PIRenderer [30] trained with AnimeCeleb successfully imitate the motion of the driving images (Last column). On the other hands, FOMM [32] often fails to preserve identity due to the geometric gap between source and driving image (3rd column).

Model Same-Identity Cross-Identity ploring the potential for improvement by augmenting an unused part (e.g., background). First, animation head reenactment is a promising and practical application, yet existing models [30, 32] trained with real domain datasets [10, 26] often fail to generalize out-of-domain animation faces. We address this limitation with a data-centric approach, and compare the trained models in controlling animation faces. Second, we verify the effectiveness of AnimeCeleb as a training dataset for unsupervised co-part segmentation by implementing a video-based co-part segmentation baseline [33] . This shows that AnimeCeleb can be a powerful source for recognizing a meaningful co-part of an animation face. Third, we choose the animation colorization task because colorization, an essential process for creating animation, is important and practical in the animation field. Lastly, image harmonization task is conducted to demonstrate extensive use cases of AnimeCeleb. In the following, we provide a brief overview of the task and show experimental results for each task. Given a source image from Waifu dataset (1st column) and a driving image from AnimeCeleb (2nd column), both FOMM [32] and PIRenderer [30] trained with only AnimeCeleb successfully transfers the driving motion to the source image. Figure 8 . Animation co-part segmentation results. A trained model is able to recognize the consistent semantics in different images as seen in Segmentation columns.

Head Reenactment Head reenactment task aims to transfer a motion from a driving image to a source image while preserving the source identity. Common approaches [30, 32, 41, 42] take advantage of two frames extracted from the same video (i.e., same identity), where frames serve as source and driving images respectively. During training, source and driving image are embedded to represent the identity and motion; motion representation is depicted as unsupervised landmarks, outputs of off-the-shelf landmarks and 3D Morphable Model (3DMM) parameters extractor [6, 13] .

Thanks to the AnimeCeleb property of containing multiple images of the same identity, we implement two representative baselines: FOMM [32] and PIRenderer [30] . The former [32] uses unsupervised landmarks as pose represen- Figure 9 . Colorization results in automatic manner and by referring to in-domain and out-of-domain images. Pix2Pix [19] trained with AnimeCeleb successfully outputs the plausible colorized image. Also, a reference-based model [23] successfully fill a given sketch image with the colors of reference images.

tations and combines sparsely located landmarks to predict a dense flow field; the latter [30] utilizes spatial-agnostic 3DMM parameters. Instead of 3DMM parameters, we use the generated pose vector as a motion descriptor to rotate head and manipulate facial expression of animation faces. Fig 6 shows qualitative results of cross-identity motion imitation task. We determined that FOMM often fails to preserve detailed part of a source image a due to geometric discrepancy of corresponding keypoints between source and driving image, whereas PIRenderer performs well because driving motion is transferred via spatial-agnostic pose representations. This is also confirmed in Table 2 where PIRenderer outperforms FOMM by a large margin on Anime-Celeb. Note that a model trained with VoxCeleb [26] does not work well on AnimeCeleb test images due to a distribution gap.

We evaluate animation head reenactment models on two tasks: (1) Same-identity reconstruction where the source and the driving images are of the same character, and (2) Cross-identity motion transfer where the source and the driving images are from the different characters. We use Structural Similarity (SSIM) [38] and Peak Signal-to-Noise Ratio (PSNR) to estimate the reconstruction error. In addition, we employ Fréchet Inception Distance (FID) [18] , which is a widely used metric for evaluating the quality of images.

Out-of-Domain Head Reenactment To verify the effectiveness of AnimeCeleb on generalizing out-of-domain animation heads, we collect 500 randomly sampled anime characters from Waifu Labs 6 . Fig 7 presents randomly generated results using Waifu Labs anime characters as source images with driving images from AnimeCeleb. Implemented baselines [30, 32] shows a promising performance to transfer the driving motion to source image. Especially, PIRenderer [30] , recently proposed AdaIN-based method, works well in transforming the source head angles as well as manipulating a detailed facial expression while preserving vivid textures of the source images. We provide more qualitative results on animation-related images in supplementary material. Co-part Segmentation Recently, video-based co-part segmentation [33] has been proposed to tackle the problem of co-part segmentation by exploiting motion information to discover meaningful object parts. To this end, the model [33] leverages two frames sampled from the same video: a source and a driving image, and reconstructs the driving image during training via a predicted optical flow. Following the training strategy, we train the model [33] with AnimeCeleb. As can be seen in Fig 8, the trained model successfully discovers semantically consistent parts of samples: overall faces, accessories on the head and jaws.

In the animation field, automatic colorization is an important task for animation creators to reduce their effort during the labor-intensive painting process. Using a trained colorization model, creators are able to obtain colorized images given sketch images. Early approaches [19] have ad- Figure 10 . Animation harmonization results. F.G., B.G. and Acc. denote a foreground object, a background, and an accessory, respectively. The components for image harmonization (1st column) are well-blended, where the backgrounds and the accessories are refined with similar styles and textures with respect to the foreground objects. dressed this problem through fully automatic manner, without additional conditions, resorting to the recommended colorized results produced by the model. However, such approaches prevent users from manipulating outputs with their desired color, restricting applicability to conduct real-world colorization task. To overcome this inconvenience, recent methods [21, 22, 46] have proposed condition-based architectures to reflect given conditions, allowing users to input diverse conditions such as color palette [46] , text tags [21] and reference images [22] .

We conduct character colorization tasks using both unconditional and conditional colorization baselines [19, 22] . Fig 9 illustrates the colorization results using the baselines trained with AnimeCeleb. As can be seen in Fig 9, the models show promising performance at painting the animation character sketch images, producing plausible colorization outputs automatically and following a given indomain reference image. To demonstrate the broad generalization capacity of the reference-based model [22] trained with AnimeCeleb, we also use out-of-domain reference images crawled from online cartoons. We find that not limited to the in-domain reference images, the model achieves plausible colorization outputs based on out-of-domain animation face images.

Image harmonization aims to generate natural composite images given two images from different domains, achieving a good match for both content and style. To tackle this problem, both learning-based [11, 12, 24] and optimization-based [25, 27, 45] methods have been actively studied. We implement a representative optimization-based approach [45] to explore the applicability of AnimeCeleb to generate more diverse animation images.

Since AnimeCeleb images contain only a foreground object (i.e., animation character face), compositions with arbitrary background are a natural extension of Anime-Celeb, allowing it to utilize unused parts. Not limited to the background composition, the decorative objects (e.g., sunglasses, scarfaces, caps and masks) are available sources to be exploited for the composition. Thanks to pre-computed segmentation masks, we can easily employ an optimizationbased composition model [45] and show experimental results. As shown in Fig 10, both background and decorative object composition with AnimeCeleb produce plausible results, demonstrating the potential for AnimeCeleb to provide full images with diverse backgrounds and multiple objects.

In this section, we discuss potential issues and directions for improvement of AnimeCeleb in further research. Camera Viewpoint Bias During the dataset generation process, we fixed the camera position with the aim of capturing frontal faces of animation characters. Although this enables us to extract character face easily, the fixed camera position also constrained dataset diversity, failing to include multiview animation face images. Such training dataset bias can be a bottleneck that weakens generalizability of deep learning models trained with AnimeCeleb. To supplement this issue, we plan to consider the camera position as another variable for further improvement of AnimeCeleb. Extension of Proposed Pipeline Due to limited resources, the proposed pipeline is designed to generate a multi-pose and single-view animation face with limited expression. We believe that AnimeCeleb has room to be improved in three aspects: (i) Adding more target morphs to diversify facial expression, (ii) Combining AnimeCeleb with various background images using copying or harmonization techniques, (iii) Exploiting full body information to generate an animation body dataset. As future work, we will tackle these issues to construct an improved version of AnimeCeleb.

In this paper, we present AnimeCeleb, a large-scale animation face datset, which is a valuable and practical resource for developing animation-related data-driven research. Departing from existing animation face datasets, we utilize 3D animation models to construct our animation face dataset by simulating facial expressions and head rotation, leading to the building of a clean animation face dataset with rich annotations. For this purpose, we built a semiautomatic data generation pipeline based on Blender and a semantics annotation tool. In the pipeline, collected 3D models play roles as valuable sources for data generation, combining with the powerful functions of the 3D software and model-accompanied rich features including morphs. Experiments conducted demonstrate the utility of Anime-Celeb for broad animation-related tasks. In future work, we plan to extend AnimeCeleb to provide more diverse facial expressions in a multi-view environment.

Gwern animation face

Pramook animation face

A morphable model for the synthesis of 3d faces

Gwern Branwen, Anonymous, and Danbooru Community. Danbooru2019: A large-scale anime character illustration dataset

How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks)

Neural head reenactment with latent pose descriptors

Animegan: A novel lightweight gan for photo animation

Cartoongan: Generative adversarial networks for photo cartoonization

Deep speaker recognition

Dovenet: Deep image harmonization via domain verification

Improving the harmony of the composite image by spatial-separated attention module

Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

Manga109 dataset and creation of metadata

Comicolorization: semi-automatic manga colorization

Marionette: Few-shot face reenactment preserving identity of unseen targets

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Image-to-image translation with conditional adversarial networks

Towards the automatic anime characters creation with generative adversarial networks

Tag2pix: Line art colorization using text tag with secat and changing loss

Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence

Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence

Region-aware adaptive instance normalization for image harmonization

Deep painterly harmonization

Voxceleb: a large-scale speaker identification dataset

Poisson image editing

A faster r-cnn based method for comic characters face detection

Pirenderer: Controllable portrait image generation via semantic neural rendering

Daf: re: A challenging, crowd-sourced, large-scale, longtailed dataset for anime character recognition

First order motion model for image animation

Motionsupervised co-part segmentation

Deep animation video interpolation in the wild

Learning-by-synthesis for appearance-based 3d gaze estimation

Face detection and face recognition of cartoon characters using feature extraction

One-shot free-view neural talking-head synthesis for video conferencing

Image quality assessment: from error visibility to structural similarity

Fake it till you make it: Face analysis in the wild using synthetic data alone

Disentangling style and content in anime illustrations

Fast bi-layer neural synthesis of one-shot realistic head avatars

Few-shot adversarial learning of realistic neural talking head models

Df2net: A dense-fine-finer network for detailed 3d face reconstruction

Acfd: Asymmetric cartoon face detector

Deep image blending

Real-time user-guided image colorization with learned deep priors

Cartoon face recognition: A benchmark dataset