International Journal for Digital Art History, Issue #2 Showing Digitized Corpora Figure 1: Illustration of our system for classification of fine-art paintings. We investigated variety of visual features and metric learning approaches to recognize Style, Genre and Artist of a painting. 1 Introduction In the past few years, the number of fine-art collections that are digitized and publicly available has been growing rapidly. Such collections span classical and modern and con tem po- rary artworks. With the availability of such large collections of digitized artworks comes the need to develop multimedia systems to archive and retrieve this pool of data. Typically, these collections, in particular early modern ones, come with meta data in the form of annotations by art historians and curators, including in formation about each painting’s artist, style, date, genre, etc. For online galleries displaying con temporary art- work, there is a need to develop auto- mated recommendation systems that can retrieve “similar” paintings that the user might like to buy. This high- lights the need to investigate metrics of visual similarity among digitized paintings that are optimized for the domain of painting. The field of computer vision has made significant leaps in getting digital systems to recognize and Abstract: In the past few years, the number of fine-art collections that are digitized and publicly available has been growing rapidly. With the availability of such large collections of digitized artworks comes the need to develop multimedia systems to archive and retrieve this pool of data. Measuring the visual similarity between artistic items is an essential step for such multimedia systems, which can benefit more high- level multimedia tasks. In order to model this similarity between paintings, we should extract the appropriate visual features for paintings and find out the best approach to learn the similarity metric based on these features. We investigate a comprehensive list of visual features and metric learning approaches to learn an optimized similarity measure between paintings. We develop a machine that is able to make aesthetic- related semantic-level judgments, such as predicting a painting’s style, genre, and artist, as well as providing similarity measures optimized based on the knowledge available in the domain of art historical interpretation. Our experiments show the value of using this similarity measure for the aforementioned prediction tasks. Keywords: similarity metric, visual features, metric learning, convolutional neural net- works, style, genre, artist Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature Peer-Reviewed Babak Saleh, Ahmed Elgammal 72 DAH-Journal Large-scale Classification categorize objects and scenes in images and videos. These advances have been driven by a wide spread need for the technology, since cameras are everywhere now. However, a person looking at a painting can make sophisticated inferences beyond just recognizing a tree, a chair, or the figure of Christ. Even individuals without specific art historical training can make assumptions about a painting’s genre (portrait or landscape), its style (impressionist or abstract), what century it was created, the artists who likely created the work and so on. Obviously, the accuracy of such assumptions depends on the viewer’s level of knowledge and exposure to art history. Learning and judging such complex visual concepts is an impressive ability of human perception [2]. The ultimate goal of our research is to develop a machine that is able to make aesthetic-related semantic- level judgments, such as predicting a painting’s style, genre, and artist, as well as providing similarity measures optimized based on the knowledge available in the domain of art historical interpretation. Immediate questions that arise include, but are not limited to: What visual features should be used to encode information in images of paintings? How does one weigh different visual features to achieve a useful similarity measure? What type of art historical knowledge should be used to optimize such similarity measures? In this paper we address these questions and aim to provide answers that can benefit researchers in the area of computer-based analysis of art. Our work is based on a systematic methodology and a comprehensive evaluation on one of the largest available digitized art datasets. Artists use different concepts to describe paintings. In particular, stylistic elements, such as space, texture, form, shape, color, tone and line are used. Other principles include movement, unity, harmony, variety, balance, contrast, proportion, and pattern. To this might be added physical attributes, like brush strokes as well as subject matter and other descriptive concepts [13]. For the task of computer analyses of art, researchers have engineered and investigated various visual features that encode some of these artistic concepts, in particular brush strokes and color, which are encoded as low- level features such as texture statistics and color histograms (e.g. [19, 20]). Color and texture are highly prone to variations during the digitization of paintings; color is also affected by a painting’s age. The effect of digitization on the computational analysis of paintings is investigated in great depth by Polatkan et al. [24]. This highlights the need to carefully design visual features that are suitable for the analysis of paintings. Clearly, it would be a cumbersome process to engineer visual features that encode all the aforementioned DAH-Journal 73 Large-scale Classification artistic concepts. Recent advances in computer vision, using deep neural networks, showed the advantage of “learning” the features from data in- stead of engineering such features. How ever, it would also be impractical to learn visual features that encode such artistic concepts, since that would require extensive annotation of these concepts in each image within a large training and testing dataset. Obtaining such annotations require expertise in the field of art history that can not be achieved with typical crowed-sourcing annotators. Given the aforementioned chal- lenges to engineering or learning suit- able visual features for painting, in this paper we follow an alternative strategy. We mainly investigate different state- of-the-art visual elements, ranging from low-level elements to semantic-level elements. We then use metric learning to achieve optimal similarity metrics between paintings that are optimized for specific prediction tasks, namely style, genre, and artist classification. We chose these tasks to optimize and evaluate the metrics since, ultimately, the goal of any art recommendation system would be to retrieve artworks that are similar along the directions of these high-level semantic concepts. Moreover, annotations for these tasks are widely available and more often agreed-upon by art historians and critics, which facilitates training and testing the metrics. In this paper we investigate a large space of visual features and learning methodologies for the aforementioned prediction tasks. We propose and compare three learning methodologies to optimize such tasks. We present results of a comprehensive comparative study that spans four state-of-the-art visual features, five metric learning approaches and the proposed three learning methodologies, evaluated on the aforementioned three artistic prediction tasks. 2 Related Work On the subject of painting, computers have been used for a diverse set of tasks. Traditionally, image processing techniques have been used to provide art historians with quantification tools, such as pigmentation analysis, statistical quantification of brush strokes, etc. We refer the reader to [28, 5] for comprehensive surveys on this subject. Several studies have addressed the question of which features should be used to encode information in paintings. Most of the research con- cerning the classification of paintings utilizes low-level features encoding color, shadow, texture, and edges. For ex ample Lombardi [20] has presented a study of the performance of these types of features for the task of artist clas sification among a small set of artists using several supervised and un supervised learning methodologies. In that paper the style of the painting was identified as a result of recognizing the artist. 74 DAH-Journal Large-scale Classification Since brushstrokes provide a sig- na ture that can help identify the artist, designing visual features that encode brushstrokes has been widely adapted. (e.g. [25, 18, 22, 15, 6, 19]). Typically, texture statistics are used for that purpose. However, as mentioned earlier, texture features are highly affected by the digitization resolution. Researchers also investigated the use of features based on local edge orientation histograms, such as SIFT [21] and HOG [10]. For example, [12] used SIFT features within a Bag-of- words pipeline to discriminate among a set of eight artists. Arora et al. [3] presented a comparative study for the task of style classification, which evaluated low-level features, such as SIFT and Color SIFT [1], versus semantic-level features, namely Classemes [29], which encodes object presence in the image. It was found that semantic-level features significantly outperform low- level features for this task. However, the evaluation was conducted on a small dataset of 7 styles, with 70 paintings in each style. Carneiro et al [9] also concluded that low-level texture and color features are not effective because of inconsistent color and texture patterns that describe the visual classes in paintings. More recently, Saleh et al [26] used metric learning approaches for finding influence paths between painters based on their paintings. They evaluated three metric learning approaches to optimize a metric over low-level HOG features. In contrast to that work, the evaluation presented in this paper is much wider in scope since we address three tasks (style, genre and artist prediction), we cover features spanning from low-level to semantic-level and we evaluate five metric learning approaches. Moreover, the dataset of [26] has only 1710 images from 66 artists, while we conducted our experiments on 81,449 images painted by 1119 artists. Bar et al [4] proposed an approach for style classification based on features obtained from a convolution neural network pre- trained on an image categorization task. In contrast we show that we can achieve better results with much lower dimensional features that are directly optimized for style and genre classification. Lower dimensionality of the features is preferred for indexing large image collections. 3 Methodology In this section we explain the meth-od ology that we follow to find the most appropriate combination of visual features and metrics that produce accurate similarity measurements. We ac quire these measurements to mimic the art historian’s ability to categorize paint ings based on their style, genre and the artist who made it. In the first step, we extract visual features from the image. These visual features range from low-level (e.g. edges) to high- level (e.g. objects in the painting). More importantly, in the next step we learn how to adjust these features for diff er- ent classification tasks by learning the DAH-Journal 75 Large-scale Classification appropriate metrics. Given the learned metric we are able to project paintings from a high dimensional space of raw visual information to a meaningful space with much lower dimensionality. Additionally, learning a classifier in this low-dimensional space can be easily scaled up for large collections. In the rest of this section: First, we introduce our collection of fine-art paintings and explain what are the tasks that we target in this work. Later, we explore methodologies that we consider in this work to find the most accurate system for aforementioned tasks. Finally, we explain different types of visual features that we use to represent images of paintings and discuss metric learning approaches that we applied to find the proper notion of similarity between paintings. 3.1 Dataset and Proposed Tasks In order to gather our collection of fine-art paintings, we used the publicly available dataset of ”Wikiart paintings”; which, to the best of our knowledge, is the largest on line public collection of digitized artworks. This collection has images of 81,449 fine- art paintings from 1,119 artists ranging from fifteen centuries to contemporary artists. These paintings are from 27 different styles (Abstract, Byzantine, Baroque, etc.) and 45 different genres (Interior, Landscape, etc.) Previous work [26, 9] used different resources and made smaller collections with limited variability in terms of style, genre and artists. The work of [4] is the closest to our work in terms of data collection procedure, but the number of images in their collection is half of ours. We target automatic classification of paintings based on their style, genre and artist using visual features that are automatically extracted using computer vision algorithms. Each of these tasks has its own challenges and limitations. For example, there are large variations in terms of visual appearances in paintings from one specific style. However, this variation is much more limited for paintings by one artist. These larger intra-class variations suggest that style classification based on visual features is more challenging than artist classification. For each of the tasks we selected a subset of the data that ensure enough samples for training and testing. In particular, for style classification we use a subset of the date with 27 styles where each style has at least 1500 paintings with no restriction on genre or artists, with a total of 78,449 images. For genre classification we use a subset with 10 genre classes, where each genre has at least 1500 paintings with no restriction of style or genre, with a total of 63,691 images. Similarly, for artist classification we use a subset of 23 artists, where each of them has at least 500 paintings, with a total of 18,599 images. Table 1 lists the set of style, genre, and artist labels. 76 DAH-Journal Large-scale Classification 3.2 Classification Methodology In order to classify paintings based on their style, genre or artist we followed three methodologies. Metric Learning: First, as depicted in figure 1, we extract visual features from images of paintings. For each of these prediction tasks, we learn a similarity metric optimized for it, i.e. style-optimized metric, genre-op- ti mized metric and artist-optimized metric. Each metric induces a projector to a corresponding feature space op- ti mized for the corresponding task. Having the metric learned, we project the raw visual features into the new op ti mized feature space and learn clas- si fiers for the corresponding prediction task. For that purpose, we learn a set of one-vs-all SVM classifiers for each of the labels in table 1 for each of the tasks. While our first strategy focuses on clas sification based on combinations of a metric and a visual feature, the next two methodologies that we followed fuse different features or different metrics. Feature fusion: The second meth- od ology that we used for classification is depicted in figure 2. In this case, we extract different types of visual Figure 2: illustration of our second methodology – Feature Fusion. DAH-Journal 77 Large-scale Classification features (four types of features as will be explained next). Based on the pre diction task (e.g. style) we learn the metric for each type of feature as before. After projecting these features sep arately, we concatenate them to make the final feature vector. The clas- si fication will be based on training clas- si fiers using these final features. This feature fusion is important as we want to capture different types of visual information by using different types of features. Also concatenating all features together and learn a metric on top of this huge feature vector is com- pu tationally intractable. Because of this issue, we learn metrics on feature separately and after projecting features by these metrics, we can concatenate them for classification purposes. Metric-fusion: The third meth od- ol ogy (figure 3) projects each visual fea tures using multiple metrics (in our experiment we used five metrics as will be explained next) and then fuses the resulting optimized feature spaces to obtain a final feature vector for classification. This is an im por- tant strategy, because each one of the metric learning approaches use a different criterion to learn the sim- i larity measurement. By learning all metrics individually (on the same type of feature), we make sure that we took into account all criteria (e.g. information theory along with neigh- bor hood analysis). Figure 3: Illustration of our third methodology – Metric Fusion. 78 DAH-Journal Large-scale Classification 3.3 Visual Features Visual features in computer vision literature are either engineered and extracted in an unsupervised way (e.g. HOG, GIST) or learned based on optimizing a specific task, typically categorization of objects or scenes (e.g. CNN-based features). This results in high-dimensional feature vectors that might not necessary correspond to nameable (semantic- level) characteristics of an image. Based on the ability to find a meaning, visual features can be categorized into low-level and high-level. Low-level features are visual descriptors that there is no explicit meaning for each dimension of them, while high-level visual features are designed to capture some notions (usually objects). For this work, we investigated some state-of- the-art representatives of these two categories: Low-level Features, GIST: Human observers can rapidly capture the “gist” of a scene in a quick feed-forward sweep. Therefore, a computational model for “gist” seems a reasonably essential tool for rapid scene classification. Gist has been modelled as average pooling of low-level biologically-inspired fea- tures (i.e. gabor-like features) over non-overlapping subregions ar ranged on a fixed grid. The term “spatial envelope” has been also used to refer to this very low dimensional rep- re sentation of the [23]. Indeed, gist model bypasses the procedures that are usually applied in scene classification, such as segmentation and processing of individual objects. The dominant spatial structure of a scene is represented in a set of per ceptual dimensions (naturalness, openness, roughness, expansion, rug- gedness). The gist model estimates these dimensions using spectral and coarsely localized information. To calculate the gist features, each image is divided into 16 bins, and then oriented Gabor filters (in 8 orientations) are applied over different scales (4 scales) in each bin. Finally, the average filter energy in each bin is calculated [24]. We followed this procedure and extracted 512-dimensioanl feature vector of GIST for each image. Learned Semantic-level Features: For the purpose of semantic representation of the images, we extracted three object-based representation of the images: Classeme [29], Picodes [8], and CNN-based features [16]. In all these three features, each bit (element) of the feature vector represents the confidence of the presence of an object-category in the image, therefore they provide a semantic encoding of the images. The list of object categories is user-specified and not covering all object categories in the real world. Despite the limited number of categories in this type of modeling, these semantic encoding of images have shown remarkable results for the task of image search. DAH-Journal 79 Large-scale Classification However, for learning these fea- tures, the object-categories are generic, mostly used for realistic images and are not specifically designed for the purpose of art. First two features are designed to capture the presence of a set of basic-level object categories as following: a list of entry-level categories (e.g. horse and cross) is used for downloading a large collection of images from the web. For each image a comprehensive set of low-level visual features are extracted and one classifier is learned for each category. For a given test image, these classifiers are applied on the image and the responses (confidences) make the final feature vector. We followed the implementation of [7] and for each image extracted a 2659 dimensional real-valued Classeme feature vector and a 2048 dimensional binary-value Picodes feature. Convolutional Neural Net works (CNN) [17] showed a remarkable performance for the task of large-scale image categorization [16]. CNNs have four convolutional layers followed by three fully connected layers. Bar et al [4] showed that a combination of the output of these fully connected layers achieve a superior performance for the task of style classification of paintings. Following this observation, we used the last layer of a pre-trained CNN [16] (1000 dimensional real-valued vectors) as another feature vector. 3.4 Metric Learning These aforementioned extracted visual features are meant to be used for real images, therefore, we should tune these features to perform reasonable on paintings as well. We consider using these features for the task of classifications in fine- art paintings, which is equivalent to put similar paintings close to each other. For the purpose of similarity measurement, we apply a list of metric learning approaches to find a reasonable approach. Metric learning is an active research area in the field of machine learning and we encourage interested readers to check surveys on this topic. In a formal notion, metric learning is defined as finding a real- valued mathematical function that assigns a score to each pair of its input. This score shows how similar are these items, where smaller number shows less difference and higher similarity. For this paper, we consider the following metric learning approaches: Neighborhood Component Analy sis (NCA): This approaches focuses on analyzing the nearest neigh- bors. This analysis is mainly based on putting neighbors of the same class (e.g. painting style in our study) close to each other. Large Margin Nearest Neighbors (LMNN): LMNN [32] is an approach for learning a Mahalanobis distance, 80 DAH-Journal Large-scale Classification which is widely used because of its global optimum solution and su pe- ri or performance in practice. The learning of this metric involves a set of constrains, all of which are defined locally. This means that LMNN enforces the k nearest neighbor of any training instance belonging to the same class (these instances are called “target neighbors”). This should be done while all the instances of other classes, referred as “impostors”, should be far from this point. For finding the target neighbors, Euclidean distance has been applied to each pair of samples. This metric learning approach is related to Support Vector Machines (SVM) in principle, which theoretically engages its usage along with SVM for the task of classification. Due to the popularity of LMNN, different variations of it have been introduced, including a non-linear version called gb-LMNN [32] which we used in our experiments as well. However, its performance for classification tasks was worse that linear LMNN. We assume this poor performance is rooted in the nature of visual features that we extract for paintings. Boost Metric [27]: The idea behind this approach follows this intuition: instead of learning a universal metric that works the best on all data, it might be better to learn and combine a set of weaker metrics that are not universal (giving the best performance across all data), but have a reasonable performance on a subset of the data. Shen et al [27] use this fact and instead of learning a metric directly, finds a set of metrics that can be combined and give the final metric. They treat each of these matrices as a Weak Learner, which is used in the literature of Boosting methods. The resulting algorithm applies the idea of AdaBoost to Mahalanobis distance, which has been shown to be quiet efficient in practice. This method is particularly of our interest, because we can learn an individual metric for each style of paintings and finally merge these metrics to get a unique final metric. Theoretically the final metric can perform well to find similarities inside each style/genre of paintings as well. Information Theory Metric Learning (ITML) [11]: This metric learning algorithm is based on Information theory rather than numerical distances. In other words, the learning part of this metric is rooted in entropy measurement and probability models. Metric Learning for Kernel Regression (MLKR): This approach performs similar to NCA, which minimizes the classification error. Weinberger and Tesauro [31] learn a metric by optimizing the leave-one-out error for the task of kernel regression. In kernel regression, there is an essential need for proper distances between points that will be used for weighting sample data. MLKR learn this distance by minimizing the leave-one-out error for regression on training data. DAH-Journal 81 Large-scale Classification Although this metric learning method is designed for kernel regression, the resulted distance function can be used in variety of tasks. 4 Experiments 4.1 Experimental Setting Visual Features: As we explained in section 3, we extract GIST features as low-level visual features and Classeme, Picodes and CNN-based features as the high-level semantic features. We followed the original implementation of Oliva and Torralba [23] to get a 512 dimensional feature vector. For Classeme and Picodes we used the implementation of Bergamo et al [29], resulting in 2659 dimensional Classeme features and 2048 dimensional Picodes features. We used the implementation of Vedaldi and Lenc [30] to extract 1000 dimensional feature vectors of the last layer of CNN. W Object-based representations of the images produce feature vectors that are much higher in dimensionality than GIST descriptors. In the sake of a fair comparison of all types of features for the task of metric learning, we transformed all feature vectors to have the same size as GIST (512 dimensional). We did this by applying Principle Component Analysis (PCA) for each type and projecting the original features onto the first 512 eigenvectors (with biggest eigen values). In order to verify the quality of projection, we looked at the corresponding coefficients of eigen values for PCA projections. In- Figure 4: PCA coefficients for CNN features 82 DAH-Journal Large-scale Classification dependent of feature type, the value of these coefficients drops significantly after the first 500 eigenvectors. For example, figure 4 plots these coefficients of PCA projection for CNN features. Summation of the first 500 coefficients is 95.88% of the total summation. This shows that our projections (with 512 eigenvectors) captures the true underlying space of the original features. Using these reduced features speeds up the metric learning process as well. Metric Learning We used im ple- mentation of [32] to learn LMNN metric (both version of linear and non-linear) and MLKR. For the BoostMetric we slightly adjusted the implementation of [27]. For NCA we adopted its im ple- mentation by Fowlkes to work on large scale feature vectors smoothly. For the case of ITML metric learning, we followed the original im ple mentation of authors with the default setting. For the rest of methods, pa ra m e ters are chosen through a grid search that finds the minimum nearest neigh bor classification. Regarding the training time, learning the ITML metric was the fastest and learning NCA and LMNN were the slowest ones. Due to com pu- tational constrains we set the pa ram- eters of LMNN metric to reduce the size of features to 100. NCA metric re- duces the dimension of features to the number of categories for each tasks: 27 for style classification, 23 for artist classification and 10 for genre clas- si fication. We randomly picked 3000 samples, which we used for metric learning. These samples follow the same distribution as original data and are not used for classification experiments. 4.2 Classification Experiments For the purpose of metric learning, we conducted experiments with labels for three different tasks of style, genre and artist prediction. In following sections, we investigate the performance of these metrics on different features for classification of aforementioned concepts. We learned all the metrics in section 3 for all 27 styles of paintings in our dataset (e.g. Expressionism, Realism, etc.). However, we did not use all the genres for learning metrics. In fact, in our dataset we have 45 genres, some of which have less than 20 images. This makes the metric learning impractical and highly biased toward genres with larger number of paintings. Because of this issue, we focus on 10 genres with more than 1500 paintings. These genres are listed in table 1. In all experiments we conducted 3-fold cross validation and reported the average accuracy over all partitions. We found the best value for penalty term in SVM (which is equal to 10) by three-fold cross validation. In the next three sections, we explain settings and findings for each task independently. Style Classification: Table 2 contains the result (accuracy per- DAH-Journal 83 Large-scale Classification centage) of style classification (SVM) after applying different metrics on a set of features. Columns correspond to different features and rows are different metrics that are used for projecting features before learning style classifiers. In order to quantify the improvement by learning similarity metrics, we conducted a baseline experiment (first row in the table) as the following: For each type of features, we learn a set of one-vs-all classifiers on raw feature vectors. Generally, Boost metric learning and ITML approaches give the highest in accuracy for the task of style classification over different visual features. However, the greatest improvement over the baseline is gained by application of Boost metric on Classeme features. We visualized the confusion matrix for the task of style classification, when we learn Boost metric on Classeme features. Figure 5 shows this matrix, where red represents higher values. Further analysis of some confusions that are captured in this matrix result in interesting findings. In the rest of this paragraph we explain some of these cases. First, we found that there is a big confusion between “Abstract expressionism” (first row) and “Action paintings” (second column). Art historians verify the fact that this confusion is meaningful and somehow expected. “Action painting” is a type or subgenre of “abstract expressionism” and are characterized by paintings created through a much more active process– drips, flung paint, stepping on the canvas. Figure 5: Confusion matrix for Style classification. Confusions are meaningful only when seen in color. 84 DAH-Journal Large-scale Classification Another confusion happens be- tween “Expressionism” (column 10) and “Fauvism” (row 11), which is actually expected based on art history literature. “Mannerism” (row 14) is a style of art during the (late) “Renaissance” (column 12), where they show unusual effect in scale and are less naturalistic than “Early Renaissance”. This similarity between “Mannerism” (row 14) and “Renaissance” (column 12) is captured by our system as well where results in confusion during style classification. “Minimalism” (column 15) and “Color field paintings” (6th row) are mostly confused with each other. We can agree on this finding as we look at members of these styles and figure out the similarity in terms of simple form and distribution of colors. Lastly some of the confusions are completely acceptable based on the origins of these styles (art movements) that are noted in art history literature. For example, “Renaissance” (column 18) and “Early Renaissance” (row 9); “Post Impressionism” (column 21) and “Impressionism” (row 13); “Cubism” (8th row) and “Synthetic Cubism” (column 26). Synthetic cubism is the later act of cubism with more color continued usage of collage and pasted papers, but less linear perspective than cubism. Genre Classification: We narrowed down the list of all genres in our dataset (45 in total) to get a reasonable number of samples for each genre (10 selected genres are listed Figure 6: Confusion matrix for Genre classification. Confusions are meaningful only when seen in color. DAH-Journal 85 Large-scale Classification in table 1). We trained ten one-vs-all SVM classifiers and compare their performance in Table 3. In this table columns represent different features and rows are different metric that we used to compute the distance. As table 3 shows we achieved the best per formance for genre classification by learning Boost metric on top of Classeme features. Generally, the performance of these classifiers are better than classifiers that we trained for style classification. This is expected as the number of genres is less than the number of styles in our dataset. Figure 6 shows the confusion matrix for classification of genre by learning Boost metric, when we used Classeme features. Investigating the confusions that we find in this matrix, reveals interesting results. For example, our system confuses “Landscape” (5th row) with “Cityspace” (2nd column) and “Genre paintings” (3rd column). However, this confusion is expected as art historians can find common elements in these genres. On one hand “Landscape” paintings usually show rivers, mountains and valleys and there is no significant figure in them; frequently very similar to “Genre paintings” as they capture daily life. The difference appears in the fact that despite the “Genre paintings”, “Landscape” paintings are idealized. On the other hand, “Landscape” and “Cityspace” paintings are very similar as both have open space and use realistic color tonalities. Figure 7: Confusion matrix for Artist classification. Confusions are meaningful only when seen in color. Confusion matrix for artist classification 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 Figure 7: Confusion matrix for Artist classification. Confusions are meaningful only when seen in color. 86 DAH-Journal Large-scale Classification Artist Classification: For the task of the artist classification, we trained one-vs-all SVM classifiers for each of 23 artists. For each test image, we determine its artist by finding the classifier that produces the maximum confidence. Table 4 shows the per- formance of different combinations of features and metrics for this task. In general learning Boost metric improves artist classification better than all other metrics, except the case of CNN features where learning ITML metric gained the best performance. We plotted the confusion matrix of this classification task in figure 7. In this plot, some confusions between artists are clearly reasonable. We investigated two cases: First case, “Claude Monet” (5th row) and “Camille Pissaro”(3rd column). Both of these Impressionist artists who lived in the late nineteen and early twentieth centuries. Interestingly, based on art history literature Monet and Pissaro became friends when they both attended the” Acade ́mie Suisse” in Paris. This friendship lasted for a long time and resulted in some noticeable interactions between them. Second case, paintings of “Childe Hassam” (4th row) are mostly confused with ones from “Monet” (5th column). This confusion is acceptable as Hassam is an American Impressionist, who declared himself as being influenced by French Impressionists. Hassam called himself an “Extreme Impressionist”, who painted some flag-themed artworks similar to Monet. By looking at reported performances in tables 2-4, we conclude that, all three classification tasks can benefit from learning the appropriate metric. This means that we can improve the accuracy of baseline classification by learning metrics independent of the type of visual feature or the concept that we are classifying painting based on. Experimental results show that, independent of the task, NCA and MLKR approaches are performing worse than other metrics. Additionally, Boost metric always gives the best or the second best results for all classification tasks. Regarding analysis of importance of features, we can verify that Classeme and Picode features are better image representations for classification purposes. Based on these classification experiments, we claim that Classemes and Picodes features perform better than CNN features. This is rooted in the fact that amount of supervision for training Classeme and Picodes is more than CNN training. Also, unlike Classeme and Picodes, CNN feature is designed to categorize the object insides a given bounding box. However, in the case of paintings we cannot assume that all the bounding boxes around the objects are given. Integration of Features and Metrics We investigated the performance of different metric learning approaches and visual features individually. In the next step, we find out the best per- formance for aforementioned clas si- DAH-Journal 87 Large-scale Classification fication tasks by combining different visual features. Toward this goal, we followed two strategies. First, for a given metric, we project visual features by applying the metric and concatenate these projected visual features together. Second, we fixed the type of visual feature that we use and project it with the application of different metrics and concatenate these projections all together. Having this larger feature vectors (either of two strategies), we train SVM classifiers for three tasks of Style, Genre and Artist classification. Table 6 shows the results of these experiments where we followed the earlier strategy and table 5 shows the results of the later case. In general we get better results by fixing the metric and concatenating the projected feature vectors (first strategy). The work of Bar et al [4] is the most similar to ours and we compare our final results of these experiments with their reported performance. [4] only performed the task of style classi- fication on half of the images in our data set and achieved the accuracy of 43% by using two variations of Pi- CoDes features and two layers of CNN. However, we outperform their ap- proach by achieving 45.97 % accuracy for the task of style classification when we used LMNN metric to project GIST, Class eme, PiCoDes and CNN features and concatenate them all together as it is reported in the third column of table 6. Our contribution goes beyond outperforming state-of-the-art by learning a more compact feature representation. In this work, our best per formance for style classification happens when we concatenate four 100-dimensional feature vectors. This results in a 400 dimensional feature vectors that we train SVM classifiers on top of them. However [4] extract a 3882 dimensional feature vector to their best reported performance. As a result we not only outperform the state-of-the-art, but presented a better image representation that reduces the amount of space by 90%. Our efficient feature vector is an extremely useful image representation that gains the best classification accuracy and we consider its application for the task of image retrieval as future work. To qualitatively evaluate extracted visual features and learned metrics, we did a prototype image search task. As the feature fusion with application of LMNN metric gives the best per- formance for style classification, we used this setting as our similarity measurement model. Figure 8 shows some sample output of this image search task. For each pair, the image on the left is the query image, which we find the closest match (image on the right) to it based on LMNN and feature fusion. However, we force the system to pick the closest match that does not belong to the same style as the query image. This verifies that although we learn the metric based on style labels, the learned projection can find similarity across styles. 88 DAH-Journal Large-scale Classification 5 Conclusion and Future Works In this paper we investigated the applicability of metric learning ap- proaches and performance of different vi su al features for learning similarity in a collection of fine-art paintings. We implemented meaningful metrics for measuring similarity between paint- ings. These metrics are learned in a su per vised manner to put paintings from one concept close to each other and far from others. In this work we used three concepts: Style, Genre and Artist. We used these learned metrics to transform raw visual features into an other space that we can significantly im prove the performance for three im- por tant tasks of Style, Genre and Artist classification. We conducted our com- par ative experiments on the largest pub licly available dataset of fine-art paint ings to evaluate the performance for the aforementioned tasks. We conclude that: 1. Classeme features show the superior per for mance for all three tasks of Style, Genre or Artist classification. This su- per ior performance is independent of the type of metric that has been learned. 2. In the case of working on individual type of visual features, Boost metric and Information Theoretic Metric Learning (ITML) approaches improve the accuracy of classification tasks across all features. 3. For the case of using different types of features all together (feature fusion), Large-Margin Nearest-Neighbor (LMNN) metric learning achieves the best performance for all classification experiments. 4. By learning LMNN metric on Classeme features, we find an optimized representation that not only out performs state-of-the art for the task of style classification, but reduce the size of feature vector by 90%. We con sider verification of applicability of this representation for the task of image retrieval and recommendation systems as future work. As other future works we would like to learn metrics based on other annotation (e.g. time period). Bibliography [1] A. E. Abdel-Hakim and A. A. Farag. C-sift: A sift descriptor with color invariant characteristics. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2006. [2] R. Arnheim. Visual thinking. University of California Press, 1969. [3] R. S. Arora and A. M. Elgammal. Towards automated classification of fine- art painting style: A comparative study. In ICPR, 2012. [4] Y. Bar, N. Levy, and L. Wolf. Classification of artistic styles using binarized features derived from a deep neural network. 2014. DAH-Journal 89 Large-scale Classification [5] A. Bentkowska-Kafel and J. Coddington. Computer Vision and Image Analysis of Art: Proceedings of the SPIE Electronic Imaging Symposium, San Jose Convention Center, 18-22 January 2010. PROCEEDINGS OF SPIE. 2010. [6] I. E. Berezhnoy, E. O. Postma, and H. J. van den Herik. Automatic extraction of brush stroke orientation from paintings. Machine Vision and Applications, 20(1):1– 9, 2009. [7] A.Bergamo andL.Torresani. Classemes and other classifier-based features for efficient object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1, 2014. [8] A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes: Learning a compact code for novel-category recognition. In Advances in Neural Information Processing Systems, pages 2088–2096, 2011. [9] G. Carneiro, N. P. da Silva, A. D. Bue, and J. P. Costeira. Artistic image classification: An analysis on the printart database. In ECCV, 2012. [10] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893, June 2005. [11] J.V.Davis, B.Kulis, P.Jain, S.Sra, and I.S.Dhillon. Information-theoretic metric learning. In ICML, 2007. [12] M. V. FahadShahbazKhan, Joostvande Weijer. Whopaintedthis painting?, 2010. [13]  L. Fichner-Rathus. Foundations of Art and Design. Clark Baxter, 2008. [14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004. [15] C. R. Johnson, E. Hendriks, I. J. Berezhnoy, E. Brevdo, S. M. Hughes, I. Daubechies, J. Li, E. Postma, and J. Z. Wang. Image processing for artist identification. Signal Processing Magazine, IEEE, 25(4):37–48, 2008. [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [18] J. Li and J. Z. Wang. Studying digital imagery of ancient paintings by mixtures of stochastic models. Image Processing, IEEE Transactions on, 13(3): 340–353, 2004. [19] J. Li, L. Yao, E. Hendriks, and J. Z. Wang. Rhythmic brushstrokes distinguish van gogh from his contemporaries: Findings via automated brushstroke extraction. IEEE Trans. Pattern Anal. Mach. Intell., 2012. [20] T. E. Lombardi. The classification of style in fine-art painting. ETD Collection for Pace University. Paper AAI3189084., 2005. [21] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004. [22] S. Lyu, D. Rockmore, and H. Farid. A digital technique for art authentication. Proceedings of the National Academy of Sciences of the United States of America, 101(49):17006–17010, 2004. 90 DAH-Journal Large-scale Classification [23] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. [24] G. Polatkan and S. Jafarpour and A. Brasoveanu and S. Hughes and L. Daubechies .Detection of forgery in paintings using supervised learning. In 16th IEEE International Conference on Image Processing (ICIP), 2009. [25] R. Sablatnig, P. Kammerer, and E. Zolda. Hierarchical classification of paintings using face- and brush stroke models. 1998. [26] B., K. Abe, and A. Elgammal. Knowledge discovery of artistic influences: A metric learning approach. In ICCC, 2014. [27] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semi-definite metric learning using boosting-like algorithms. Journal of Machine Learning Research, 13:1007–1036, 2012. [28] D. G. Stork. Computer vision and computer graphics analysis of paintings and drawings: An introduction to the literature. In Computer Analysis of Images and Patterns, pages 9–24. Springer, 2009. [29] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recognition using classemes. In ECCV, 2010. [30] A. Vedaldi and K. Lenc. MatConvNet : convolutional neural networks for MATLAB. CoRR, abs/1412.4564, 2014. [31] K. Weinberger and G. Tesauro. Metric learning for kernel regression. In Eleventh international conference on artificial intelligence and statistics, pages 608–615, 2007. [32] K. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 2009. Tables Table 1: List of Styles, Genres and Artists in our collection of fine-art paintings. Numbers in the parenthesis are index of the row/column in confusion matrices 5, 6& 7 accordingly. DAH-Journal 91 Large-scale Classification Table 2: Accuracy for the task of style classification. Metric / Feature GIST Classemes Picodes CNN Dimension Baseline 10.83 22.62 20.76 12.32 512 Boost 16.07 31.77 28.58 15.18 512 ITML 13.02 30.67 28.42 15.34 512 LMNN 12.54 27 24.14 16.83 100 MLKR 12.65 24.12 14.86 12.63 512 NCA 13.29 28.19 24.84 16.37 27 Table 3: Accuracy for the task of genre classification. Metric / Feature GIST Classemes Picodes CNN Dimension Baseline 28.10 49.98 49.63 35.14 512 Boost 31.01 57.87 57.35 46.14 512 ITML 33.10 57.86 57.28 46.80 512 LMNN 39.06 54.96 54.42 49.98 100 MLKR 32.81 54.29 42.79 45.02 512 NCA 30.39 51.38 52.74 49.26 27 Table 4: Accuracy for the task of artist classification. Metric / Feature GIST Classemes Picodes CNN Dimension Baseline 17.58 45.29 45.82 20.38 512 Boost 25.65 57.76 55.50 29.65 512 ITML 19.95 51.79 53.93 31.04 512 LMNN 20.41 53.99 53.92 30.92 100 MLKR 21.22 49.61 19.54 21.77 512 NCA 18.80 53.70 53.81 22.26 27 92 DAH-Journal Large-scale Classification Table 5: Classification performance for metric fusion methodology. Task / Feature GIST Classeme Picodes CNN Style 20.21 37.33 33.27 21.99 Genre 35.94 58.29 56.09 47.05 Artist 30.37 59.37 55.65 33.62 Table 6: Classification performance for feature fusion methodology. Task / Metric Boost ITML LMNN MKLR NCA Style 41.74 45.05 45.97 38.91 40.61 Genre 58.51 60.28 58.48 55.79 54.82 Artist 61.24 60.46 63.06 53.19 55.83 Table 7: Annotation of paintings in Figure 8. Each row corresponds to one pair of images, labeled with the name of painting, its style and its artist. First six rows correspond to the six pairs on the left in Figure 8 and next six rows correspond to the pairs DAH-Journal 93 Large-scale Classification Babak Saleh is a PhD candidate in the department of computer science at Rutgers University, where he conducts research in the intersection of computer vision, machine learning, and human perception. Inspired by human visual perception, he has developed computational models for measuring typicality of an image and its application in learning more robust visual classifiers. He holds a MS in Computer Science and a second MS in Statistics from Rutgers University. He completed his undergraduate studies in Computer Science and Mathematics at Sharif University of Technology in Tehran, Iran. He is the recipient of outstanding student paper award from AAAI 2016, and NSF I-Corps award. His research has been recognized by major media and press outlets, including NBC News, PBS, New York Times, Washington Post, WIRED, Fast Company and IEEE MultiMedia. Correspondence e-mail: babaks@cs.rutgers.edu Dr. Ahmed Elgammal is an associate professor at the Department of Computer Science, Rutgers, the State University of New Jersey. He is a member of the Center for Computational Biomedicine Imaging and Modeling (CBIM) at Rutgers and affiliate member in Rutgers University Center for Cognitive Science (RUCCS.) and the director of the Art and Artificial Intelligence at Rutgers and the Human Motion Analysis Lab (HuMAn Lab.) Correspondence e-mail: elgammal@cs.rutgers.edu mailto:babaks%40cs.rutgers.edu?subject=DAH-Journal mailto:elgammal%40cs.rutgers.edu?subject=DAH-Journal Contents Editorial Big Image Data as new research opportunity in Art History Big Image Data within the Big Picture of Art History Figuring out Art History Showing Digitized Corpora Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature Gugelmann Galaxy. An Unexpected Journey through a collection of Schweizer Kleinmeister Artistic Data and Network Analysis Images as Data: Cultural Analytics and Aby Warburg’s Mnemosyne Social Network Centralization Dynamics in Print Production in the Low Countries, 1550-1750 Interview In Conversation with George Legrady: Experimenting with Meta Images. Case Studies Direct visualization techniques for the analysis of image data: the slice histogram and the growing Linking structure, texture and context in a visualization of historical drawings by Frederick Willia Workshops Computing Art Visualizing Venice