International Journal for Digital Art History, Issue #2


Showing 
Digitized 
Corpora


Figure 1: Illustration of our system for classification of fine-art paintings. We investigated variety of visual features and 
metric learning approaches to recognize Style, Genre and Artist of a painting.


1 Introduction 
In the past few years, the number 

of fine-art collections that are digitized 
and publicly available has been 
growing rapidly. Such collections span 
classical and modern and con tem po-
rary artworks. With the availability 
of such large collections of digitized 
artworks comes the need to develop 
multimedia systems to archive and 
retrieve this pool of data. Typically, 
these collections, in particular early 
modern ones, come with meta data 
in the form of annotations by art 

historians and curators, including 
in formation about each painting’s 
artist, style, date, genre, etc. For online 
galleries displaying con temporary art-
work, there is a need to develop auto-
mated recommendation systems that 
can retrieve “similar” paintings that 
the user might like to buy. This high-
lights the need to investigate metrics 
of visual similarity among digitized 
paintings that are optimized for the 
domain of painting. 

The field of computer vision has 
made significant leaps in getting 
digital systems to recognize and 

Abstract: In the past few years, the number of fine-art collections that are digitized 
and publicly available has been growing rapidly. With the availability of such large 
collections of digitized artworks comes the need to develop multimedia systems to 
archive and retrieve this pool of data. Measuring the visual similarity between artistic 
items is an essential step for such multimedia systems, which can benefit more high-
level multimedia tasks. In order to model this similarity between paintings, we should 
extract the appropriate visual features for paintings and find out the best approach to 
learn the similarity metric based on these features. We investigate a comprehensive 
list of visual features and metric learning approaches to learn an optimized similarity 
measure between paintings. We develop a machine that is able to make aesthetic-
related semantic-level judgments, such as predicting a painting’s style, genre, and artist, 
as well as providing similarity measures optimized based on the knowledge available in 
the domain of art historical interpretation. Our experiments show the value of using this 
similarity measure for the aforementioned prediction tasks. 

Keywords: similarity metric, visual features, metric learning, convolutional neural net-
works, style, genre, artist

Large-scale Classification of 
Fine-Art Paintings: Learning 
The Right Metric on The 
Right Feature

Peer-Reviewed

Babak Saleh, Ahmed Elgammal 


72 DAH-Journal

Large-scale Classification

categorize objects and scenes in 
images and videos. These advances 
have been driven by a wide spread 
need for the technology, since cameras 
are everywhere now. However, a 
person looking at a painting can make 
sophisticated inferences beyond just 
recognizing a tree, a chair, or the figure 
of Christ. Even individuals without 
specific art historical training can 
make assumptions about a painting’s 
genre (portrait or landscape), its style 
(impressionist or abstract), what 
century it was created, the artists 
who likely created the work and so 
on. Obviously, the accuracy of such 
assumptions depends on the viewer’s 
level of knowledge and exposure to 
art history. Learning and judging 
such complex visual concepts is an 
impressive ability of human perception 
[2]. 

The ultimate goal of our research 
is to develop a machine that is able 
to make aesthetic-related semantic-
level judgments, such as predicting a 
painting’s style, genre, and artist, as 
well as providing similarity measures 
optimized based on the knowledge 
available in the domain of art historical 
interpretation. Immediate questions 
that arise include, but are not limited 
to: What visual features should be 
used to encode information in images 
of paintings? How does one weigh 
different visual features to achieve a 
useful similarity measure? What type 
of art historical knowledge should 
be used to optimize such similarity 
measures? In this paper we address 
these questions and aim to provide 

answers that can benefit researchers in 
the area of computer-based analysis of 
art. Our work is based on a systematic 
methodology and a comprehensive 
evaluation on one of the largest 
available digitized art datasets. 

Artists use different concepts 
to describe paintings. In particular, 
stylistic elements, such as space, 
texture, form, shape, color, tone 
and line are used. Other principles 
include movement, unity, harmony, 
variety, balance, contrast, proportion, 
and pattern. To this might be added 
physical attributes, like brush strokes 
as well as subject matter and other 
descriptive concepts [13]. 

For the task of computer analyses 
of art, researchers have engineered and 
investigated various visual features 
that encode some of these artistic 
concepts, in particular brush strokes 
and color, which are encoded as low-
level features such as texture statistics 
and color histograms (e.g. [19, 20]). 
Color and texture are highly prone to 
variations 

during the digitization of paintings; 
color is also affected by a painting’s 
age. The effect of digitization on the 
computational analysis of paintings is 
investigated in great depth by Polatkan 
et al. [24]. This highlights the need to 
carefully design visual features that are 
suitable for the analysis of paintings. 

Clearly, it would be a cumbersome 
process to engineer visual features 
that encode all the aforementioned 


 DAH-Journal 73

Large-scale Classification

artistic concepts. Recent advances in 
computer vision, using deep neural 
networks, showed the advantage of 
“learning” the features from data in-
stead of engineering such features. 
How ever, it would also be impractical 
to learn visual features that encode 
such artistic concepts, since that would 
require extensive annotation of these 
concepts in each image within a large 
training and testing dataset. Obtaining 
such annotations require expertise in 
the field of art history that can not be 
achieved with typical crowed-sourcing 
annotators. 

Given the aforementioned chal-
lenges to engineering or learning suit-
able visual features for painting, in this 
paper we follow an alternative strategy. 
We mainly investigate different state-
of-the-art visual elements, ranging from 
low-level elements to semantic-level 
elements. We then use metric learning 
to achieve optimal similarity metrics 
between paintings that are optimized 
for specific prediction tasks, namely 
style, genre, and artist classification. 
We chose these tasks to optimize and 
evaluate the metrics since, ultimately, 
the goal of any art recommendation 
system would be to retrieve artworks 
that are similar along the directions 
of these high-level semantic concepts. 
Moreover, annotations for these tasks 
are widely available and more often 
agreed-upon by art historians and 
critics, which facilitates training and 
testing the metrics. 

In this paper we investigate a large 
space of visual features and learning 

methodologies for the aforementioned 
prediction tasks. We propose and 
compare three learning methodologies 
to optimize such tasks. We present 
results of a comprehensive comparative 
study that spans four state-of-the-art 
visual features, five metric learning 
approaches and the proposed three 
learning methodologies, evaluated 
on the aforementioned three artistic 
prediction tasks. 

2 Related Work
On the subject of painting, computers have been used for a diverse set of 
tasks. Traditionally, image processing 
techniques have been used to provide 
art historians with quantification 
tools, such as pigmentation analysis, 
statistical quantification of brush 
strokes, etc. We refer the reader to [28, 
5] for comprehensive surveys on this 
subject. 

Several studies have addressed the 
question of which features should 
be used to encode information in 
paintings. Most of the research con-
cerning the classification of paintings 
utilizes low-level features encoding 
color, shadow, texture, and edges. For 
ex ample Lombardi [20] has presented 
a study of the performance of these 
types of features for the task of artist 
clas sification among a small set of 
artists using several supervised and 
un supervised learning methodologies. 
In that paper the style of the painting 
was identified as a result of recognizing 
the artist. 


74 DAH-Journal

Large-scale Classification

Since brushstrokes provide a sig-
na ture that can help identify the 
artist, designing visual features that 
encode brushstrokes has been widely 
adapted. (e.g. [25, 18, 22, 15, 6, 19]). 
Typically, texture statistics are used for 
that purpose. However, as mentioned 
earlier, texture features are highly 
affected by the digitization resolution. 
Researchers also investigated the 
use of features based on local edge 
orientation histograms, such as SIFT 
[21] and HOG [10]. For example, [12] 
used SIFT features within a Bag-of-
words pipeline to discriminate among 
a set of eight artists. 

Arora et al. [3] presented a 
comparative study for the task of 
style classification, which evaluated 
low-level features, such as SIFT and 
Color SIFT [1], versus semantic-level 
features, namely Classemes [29], 
which encodes object presence in the 
image. It was found that semantic-level 
features significantly outperform low-
level features for this task. However, 
the evaluation was conducted on 
a small dataset of 7 styles, with 70 
paintings in each style. Carneiro et 
al [9] also concluded that low-level 
texture and color features are not 
effective because of inconsistent color 
and texture patterns that describe the 
visual classes in paintings. 

More recently, Saleh et al [26] used 
metric learning approaches for finding 
influence paths between painters based 
on their paintings. They evaluated three 
metric learning approaches to optimize 
a metric over low-level HOG features. 

In contrast to that work, the evaluation 
presented in this paper is much wider 
in scope since we address three tasks 
(style, genre and artist prediction), we 
cover features spanning from low-level 
to semantic-level and we evaluate five 
metric learning approaches. Moreover, 
the dataset of [26] has only 1710 images 
from 66 artists, while we conducted our 
experiments on 81,449 images painted 
by 1119 artists. Bar et al [4] proposed 
an approach for style classification 
based on features obtained from a 
convolution neural network pre-
trained on an image categorization 
task. In contrast we show that we 
can achieve better results with much 
lower dimensional features that are 
directly optimized for style and genre 
classification. Lower dimensionality of 
the features is preferred for indexing 
large image collections. 

3 Methodology 

In this section we explain the meth-od ology that we follow to find the 
most appropriate combination of visual 
features and metrics that produce 
accurate similarity measurements. We 
ac quire these measurements to mimic 
the art historian’s ability to categorize 
paint ings based on their style, genre 
and the artist who made it. In the first 
step, we extract visual features from 
the image. These visual features range 
from low-level (e.g. edges) to high-
level (e.g. objects in the painting). More 
importantly, in the next step we learn 
how to adjust these features for diff er-
ent classification tasks by learning the 


 DAH-Journal 75

Large-scale Classification

appropriate metrics. Given the learned 
metric we are able to project paintings 
from a high dimensional space of raw 
visual information to a meaningful 
space with much lower dimensionality. 
Additionally, learning a classifier in 
this low-dimensional space can be 
easily scaled up for large collections. 

In the rest of this section: First, we 
introduce our collection of fine-art 
paintings and explain what are the 
tasks that we target in this work. Later, 
we explore methodologies that we 
consider in this work to find the most 
accurate system for aforementioned 
tasks. Finally, we explain different 
types of visual features that we use 
to represent images of paintings and 
discuss metric learning approaches 
that we applied to find the proper 
notion of similarity between paintings. 

3.1 Dataset and 
Proposed Tasks 
In order to gather our collection of fine-art paintings, we used the 
publicly available dataset of ”Wikiart 
paintings”; which, to the best of our 
knowledge, is the largest on line public 
collection of digitized artworks. This 
collection has images of 81,449 fine-
art paintings from 1,119 artists ranging 
from fifteen centuries to contemporary 
artists. These paintings are from 27 
different styles (Abstract, Byzantine, 
Baroque, etc.) and 45 different genres 
(Interior, Landscape, etc.) Previous 
work [26, 9] used different resources 
and made smaller collections with 

limited variability in terms of style, 
genre and artists. The work of [4] is 
the closest to our work in terms of data 
collection procedure, but the number 
of images in their collection is half of 
ours. 

We target automatic classification 
of paintings based on their style, genre 
and artist using visual features that are 
automatically extracted using computer 
vision algorithms. Each of these tasks 
has its own challenges and limitations. 
For example, there are large variations 
in terms of visual appearances in 
paintings from one specific style. 
However, this variation is much more 
limited for paintings by one artist. 
These larger intra-class variations 
suggest that style classification based 
on visual features is more challenging 
than artist classification. For each of 
the tasks we selected a subset of the 
data that ensure enough samples for 
training and testing. In particular, for 
style classification we use a subset of 
the date with 27 styles where each 
style has at least 1500 paintings with 
no restriction on genre or artists, with 
a total of 78,449 images. For genre 
classification we use a subset with 
10 genre classes, where each genre 
has at least 1500 paintings with no 
restriction of style or genre, with a 
total of 63,691 images. Similarly, for 
artist classification we use a subset 
of 23 artists, where each of them has 
at least 500 paintings, with a total of 
18,599 images. Table 1 lists the set of 
style, genre, and artist labels. 


76 DAH-Journal

Large-scale Classification

3.2 Classification 
Methodology
In order to classify paintings based on their style, genre or artist we 
followed three methodologies. 

Metric Learning: First, as depicted 
in figure 1, we extract visual features 
from images of paintings. For each 
of these prediction tasks, we learn 
a similarity metric optimized for it, 
i.e. style-optimized metric, genre-op-
ti mized metric and artist-optimized 
metric. Each metric induces a projector 
to a corresponding feature space op-
ti mized for the corresponding task. 
Having the metric learned, we project 

the raw visual features into the new 
op ti mized feature space and learn clas-
si fiers for the corresponding prediction 
task. For that purpose, we learn a set 
of one-vs-all SVM classifiers for each 
of the labels in table 1 for each of the 
tasks. 

While our first strategy focuses on 
clas sification based on combinations of 
a metric and a visual feature, the next 
two methodologies that we followed 
fuse different features or different 
metrics. 

Feature fusion: The second meth-
od ology that we used for classification 
is depicted in figure 2. In this case, 
we extract different types of visual 

Figure 2: illustration of our second methodology – Feature Fusion. 


 DAH-Journal 77

Large-scale Classification

features (four types of features as 
will be explained next). Based on the 
pre diction task (e.g. style) we learn 
the metric for each type of feature as 
before. After projecting these features 
sep arately, we concatenate them to 
make the final feature vector. The clas-
si fication will be based on training clas-
si fiers using these final features. This 
feature fusion is important as we want 
to capture different types of visual 
information by using different types 
of features. Also concatenating all 
features together and learn a metric on 
top of this huge feature vector is com-
pu tationally intractable. Because of 
this issue, we learn metrics on feature 
separately and after projecting features 
by these metrics, we can concatenate 

them for classification purposes. 

Metric-fusion: The third meth od-
ol ogy (figure 3) projects each visual 
fea tures using multiple metrics (in 
our experiment we used five metrics 
as will be explained next) and then 
fuses the resulting optimized feature 
spaces to obtain a final feature vector 
for classification. This is an im por-
tant strategy, because each one of 
the metric learning approaches use 
a different criterion to learn the sim-
i larity measurement. By learning all 
metrics individually (on the same 
type of feature), we make sure that 
we took into account all criteria (e.g. 
information theory along with neigh-
bor hood analysis). 

Figure 3: Illustration of our third methodology – Metric Fusion.


78 DAH-Journal

Large-scale Classification

3.3 Visual 
Features 
Visual features in computer vision literature are either engineered 
and extracted in an unsupervised 
way (e.g. HOG, GIST) or learned 
based on optimizing a specific task, 
typically categorization of objects or 
scenes (e.g. CNN-based features). This 
results in high-dimensional feature 
vectors that might not necessary 
correspond to nameable (semantic-
level) characteristics of an image. 
Based on the ability to find a meaning, 
visual features can be categorized into 
low-level and high-level. Low-level 
features are visual descriptors that 
there is no explicit meaning for each 
dimension of them, while high-level 
visual features are designed to capture 
some notions (usually objects). For this 
work, we investigated some state-of-
the-art representatives of these two 
categories: 

Low-level Features, GIST: Human 
observers can rapidly capture the “gist” 
of a scene in a quick feed-forward sweep. 
Therefore, a computational model for 
“gist” seems a reasonably essential tool 
for rapid scene classification. Gist has 
been modelled as average pooling of 
low-level biologically-inspired fea-
tures (i.e. gabor-like features) over 
non-overlapping subregions ar ranged 
on a fixed grid. The term “spatial 
envelope” has been also used to refer 
to this very low dimensional rep-
re sentation of the [23]. Indeed, gist 

model bypasses the procedures that are 
usually applied in scene classification, 
such as segmentation and processing 
of individual objects. 

The dominant spatial structure 
of a scene is represented in a set of 
per  ceptual dimensions (naturalness, 
openness, roughness, expansion, rug-
gedness). The gist model estimates 
these dimensions using spectral and 
coarsely localized information. To 
calculate the gist features, each image is 
divided into 16 bins, and then oriented 
Gabor filters (in 8 orientations) are 
applied over different scales (4 scales) 
in each bin. Finally, the average filter 
energy in each bin is calculated [24]. 
We followed this procedure and 
extracted 512-dimensioanl feature 
vector of GIST for each image.

Learned Semantic-level Features: For 
the purpose of semantic representation 
of the images, we extracted three 
object-based representation of the 
images: Classeme [29], Picodes [8], 
and CNN-based features [16]. In all 
these three features, each bit (element) 
of the feature vector represents the 
confidence of the presence of an 
object-category in the image, therefore 
they provide a semantic encoding 
of the images. The list of object 
categories is user-specified and not 
covering all object categories in the 
real world. Despite the limited number 
of categories in this type of modeling, 
these semantic encoding of images 
have shown remarkable results for the 
task of image search. 


 DAH-Journal 79

Large-scale Classification

However, for learning these fea-
tures, the object-categories are generic, 
mostly used for realistic images and 
are not specifically designed for the 
purpose of art. First two features are 
designed to capture the presence of 
a set of basic-level object categories 
as following: a list of entry-level 
categories (e.g. horse and cross) is used 
for downloading a large collection of 
images from the web. For each image 
a comprehensive set of low-level 
visual features are extracted and one 
classifier is learned for each category. 
For a given test image, these classifiers 
are applied on the image and the 
responses (confidences) make the 
final feature vector. We followed the 
implementation of [7] and for each 
image extracted a 2659 dimensional 
real-valued Classeme feature vector 
and a 2048 dimensional binary-value 
Picodes feature. 

Convolutional Neural Net works 
(CNN) [17] showed a remarkable 
performance for the task of large-scale 
image categorization [16]. CNNs have 
four convolutional layers followed by 
three fully connected layers. Bar et al 
[4] showed that a combination of the 
output of these fully connected layers 
achieve a superior performance for the 
task of style classification of paintings. 
Following this observation, we used 
the last layer of a pre-trained CNN [16] 
(1000 dimensional real-valued vectors) 
as another feature vector. 

3.4 Metric 
Learning 
These aforementioned extracted visual features are meant to be 
used for real images, therefore, we 
should tune these features to perform 
reasonable on paintings as well. We 
consider using these features for 
the task of classifications in fine-
art paintings, which is equivalent to 
put similar paintings close to each 
other. For the purpose of similarity 
measurement, we apply a list of 
metric learning approaches to find a 
reasonable approach. Metric learning 
is an active research area in the field 
of machine learning and we encourage 
interested readers to check surveys on 
this topic. In a formal notion, metric 
learning is defined as finding a real-
valued mathematical function that 
assigns a score to each pair of its input. 
This score shows how similar are these 
items, where smaller number shows 
less difference and higher similarity. 
For this paper, we consider the 
following metric learning approaches: 

Neighborhood Component 
Analy sis (NCA):  This approaches 
focuses on analyzing the nearest neigh-
bors. This analysis is mainly based on 
putting neighbors of the same class 
(e.g. painting style in our study) close 
to each other. 

Large Margin Nearest Neighbors 
(LMNN): LMNN [32] is an approach 
for learning a Mahalanobis distance, 


80 DAH-Journal

Large-scale Classification

which is widely used because of its 
global optimum solution and su pe-
ri or performance in practice. The 
learning of this metric involves a set 
of constrains, all of which are defined 
locally. This means that LMNN enforces 
the k nearest neighbor of any training 
instance belonging to the same class 
(these instances are called “target 
neighbors”). This should be done 
while all the instances of other classes, 
referred as “impostors”, should be far 
from this point. For finding the target 
neighbors, Euclidean distance has been 
applied to each pair of samples.

This metric learning approach is 
related to Support Vector Machines 
(SVM) in principle, which theoretically 
engages its usage along with SVM 
for the task of classification. Due to 
the popularity of LMNN, different 
variations of it have been introduced, 
including a non-linear version called 
gb-LMNN [32] which we used in 
our experiments as well. However, 
its performance for classification 
tasks was worse that linear LMNN. 
We assume this poor performance is 
rooted in the nature of visual features 
that we extract for paintings. 

Boost Metric [27]: The idea behind 
this approach follows this intuition: 
instead of learning a universal metric 
that works the best on all data, it 
might be better to learn and combine 
a set of weaker metrics that are not 
universal (giving the best performance 
across all data), but have a reasonable 
performance on a subset of the data. 
Shen et al [27] use this fact and instead 

of learning a metric directly, finds a 
set of metrics that can be combined 
and give the final metric. They treat 
each of these matrices as a Weak 
Learner, which is used in the literature 
of Boosting methods. The resulting 
algorithm applies the idea of AdaBoost 
to Mahalanobis distance, which has 
been shown to be quiet efficient in 
practice. 

This method is particularly of 
our interest, because we can learn 
an individual metric for each style 
of paintings and finally merge these 
metrics to get a unique final metric. 
Theoretically the final metric can 
perform well to find similarities inside 
each style/genre of paintings as well. 

Information Theory Metric 
Learning (ITML) [11]: This metric 
learning algorithm is based on 
Information theory rather than 
numerical distances. In other words, 
the learning part of this metric is 
rooted in entropy measurement and 
probability models. 

Metric Learning for Kernel 
Regression (MLKR): This approach 
performs similar to NCA, which 
minimizes the classification error. 
Weinberger and Tesauro [31] learn a 
metric by optimizing the leave-one-out 
error for the task of kernel regression. In 
kernel regression, there is an essential 
need for proper distances between 
points that will be used for weighting 
sample data. MLKR learn this distance 
by minimizing the leave-one-out 
error for regression on training data. 


 DAH-Journal 81

Large-scale Classification

Although this metric learning method 
is designed for kernel regression, the 
resulted distance function can be used 
in variety of tasks. 

4 Experiments
4.1 Experimental 
Setting 

Visual Features: As we explained 
in section 3, we extract GIST features as 
low-level visual features and Classeme, 
Picodes and CNN-based features as 
the high-level semantic features. We 
followed the original implementation 
of Oliva and Torralba [23] to get a 
512 dimensional feature vector. For 
Classeme and Picodes we used the 
implementation of Bergamo et al [29], 
resulting in 2659 dimensional Classeme 

features and 2048 dimensional Picodes 
features. We used the implementation 
of Vedaldi and Lenc [30] to extract 
1000 dimensional feature vectors of the 
last layer of CNN. W

Object-based representations of the 
images produce feature vectors that 
are much higher in dimensionality 
than GIST descriptors. In the sake 
of a fair comparison of all types of 
features for the task of metric learning, 
we transformed all feature vectors 
to have the same size as GIST (512 
dimensional). We did this by applying 
Principle Component Analysis (PCA) 
for each type and projecting the 
original features onto the first 

512 eigenvectors (with biggest 
eigen values). In order to verify the 
quality of projection, we looked at 
the corresponding coefficients of 
eigen values for PCA projections. In-

Figure 4: PCA coefficients for CNN features


82 DAH-Journal

Large-scale Classification

dependent of feature type, the value of 
these coefficients drops significantly 
after the first 500 eigenvectors. 
For example, figure 4 plots these 
coefficients of PCA projection for 
CNN features. Summation of the 
first 500 coefficients is 95.88% of the 
total summation. This shows that our 
projections (with 512 eigenvectors) 
captures the true underlying space 
of the original features. Using these 
reduced features speeds up the metric 
learning process as well.

Metric Learning We used im ple-
mentation of [32] to learn LMNN metric 
(both version of linear and non-linear) 
and MLKR. For the BoostMetric we 
slightly adjusted the implementation 
of [27]. For NCA we adopted its im ple-
mentation by Fowlkes to work on large 
scale feature vectors smoothly. For 
the case of ITML metric learning, we 
followed the original im ple mentation 
of authors with the default setting. 
For the rest of methods, pa ra m e ters 
are chosen through a grid search that 
finds the minimum nearest neigh bor 
classification. Regarding the training 
time, learning the ITML metric was the 
fastest and learning NCA and LMNN 
were the slowest ones. Due to com pu-
tational constrains we set the pa ram-
eters of LMNN metric to reduce the 
size of features to 100. NCA metric re-
duces the dimension of features to the 
number of categories for each tasks: 
27 for style classification, 23 for artist 
classification and 10 for genre clas-
si fication. We randomly picked 3000 
samples, which we used for metric 
learning. These samples follow the 

same distribution as original data 
and are not used for classification 
experiments. 

4.2 Classification 
Experiments

For the purpose of metric learning, we conducted experiments with 
labels for three different tasks of 
style, genre and artist prediction. In 
following sections, we investigate 
the performance of these metrics on 
different features for classification of 
aforementioned concepts. 

We learned all the metrics in section 
3 for all 27 styles of paintings in our 
dataset (e.g. Expressionism, Realism, 
etc.). However, we did not use all the 
genres for learning metrics. In fact, in 
our dataset we have 45 genres, some of 
which have less than 20 images. This 
makes the metric learning impractical 
and highly biased toward genres  with 
larger number of paintings. Because of 
this issue, we focus on 10 genres with 
more than 1500 paintings. These genres 
are listed in table 1. In all experiments 
we conducted 3-fold cross validation 
and reported the average accuracy 
over all partitions. We found the best 
value for penalty term in SVM (which 
is equal to 10) by three-fold cross 
validation. In the next three sections, 
we explain settings and findings for 
each task independently. 

Style Classification: Table 2 
contains the result (accuracy per-


 DAH-Journal 83

Large-scale Classification

centage) of style classification (SVM) 
after applying different metrics on a 
set of features. Columns correspond to 
different features and rows are different 
metrics that are used for projecting 
features before learning style classifiers. 
In order to quantify the improvement 
by learning similarity metrics, we 
conducted a baseline experiment (first 
row in the table) as the following: For 
each type of features, we learn a set 
of one-vs-all classifiers on raw feature 
vectors. Generally, Boost metric 
learning and ITML approaches give 
the highest in accuracy for the task 
of style classification over different 
visual features. However, the greatest 
improvement over the baseline is 
gained by application of Boost metric 
on Classeme features. We visualized 
the confusion matrix for the task of 

style classification, when we learn 
Boost metric on Classeme features. 

Figure 5 shows this matrix, where 
red represents higher values. Further 
analysis of some confusions that 
are captured in this matrix result in 
interesting findings. In the rest of this 
paragraph we explain some of these 
cases. First, we found that there is 
a big confusion between “Abstract 
expressionism” (first row) and “Action 
paintings” (second column). Art 
historians verify the fact that this 
confusion is meaningful and somehow 
expected. “Action painting” is a type or 
subgenre of “abstract expressionism” 
and are characterized by paintings 
created through a much more active 
process– drips, flung paint, stepping 
on the canvas. 

Figure 5: Confusion matrix for Style classification. Confusions are meaningful only when seen in color.


84 DAH-Journal

Large-scale Classification

Another confusion happens be-
tween “Expressionism” (column 
10) and “Fauvism” (row 11), which 
is actually expected based on art 
history literature. “Mannerism” (row 
14) is a style of art during the (late) 
“Renaissance” (column 12), where they 
show unusual effect in scale and are less 
naturalistic than “Early Renaissance”. 
This similarity between “Mannerism” 
(row 14) and “Renaissance” (column 
12) is captured by our system as well 
where results in confusion during style 
classification. “Minimalism” (column 
15) and “Color field paintings” (6th 
row) are mostly confused with each 
other. We can agree on this finding 
as we look at members of these styles 
and figure out the similarity in terms 
of simple form and distribution of 

colors. Lastly some of the confusions 
are completely acceptable based on the 
origins of these styles (art movements) 
that are noted in art history literature. 
For example, “Renaissance” (column 
18) and “Early Renaissance” (row 9); 
“Post Impressionism” (column 21) and 
“Impressionism” (row 13); “Cubism” 
(8th row) and “Synthetic Cubism” 
(column 26). Synthetic cubism is the 
later act of cubism with more color 
continued usage of collage and pasted 
papers, but less linear perspective than 
cubism. 

Genre Classification: We 
narrowed down the list of all genres 
in our dataset (45 in total) to get a 
reasonable number of samples for each 
genre (10 selected genres are listed 

Figure 6: Confusion matrix for Genre classification. Confusions are meaningful only when seen 
in color.


 DAH-Journal 85

Large-scale Classification

in table 1). We trained ten one-vs-all 
SVM classifiers and compare their 
performance in Table 3. In this table 
columns represent different features 
and rows are different metric that 
we used to compute the distance. As 
table 3 shows we achieved the best 
per formance for genre classification 
by learning Boost metric on top of 
Classeme features. Generally, the 
performance of these classifiers are 
better than classifiers that we trained 
for style classification. This is expected 
as the number of genres is less than the 
number of styles in our dataset. 

Figure 6 shows the confusion matrix 
for classification of genre by learning 
Boost metric, when we used Classeme 
features. Investigating the confusions 

that we find in this matrix, reveals 
interesting results. For example, our 
system confuses “Landscape” (5th 
row) with “Cityspace” (2nd column) 
and “Genre paintings” (3rd column). 
However, this confusion is expected 
as art historians can find common 
elements in these genres. On one hand 
“Landscape” paintings usually show 
rivers, mountains and valleys and 
there is no significant figure in them; 
frequently very similar to “Genre 
paintings” as they capture daily life. 
The difference appears in the fact 
that despite the “Genre paintings”, 
“Landscape” paintings are idealized. 
On the other hand, “Landscape” and 
“Cityspace” paintings are very similar 
as both have open space and use 
realistic color tonalities. 

Figure 7: Confusion matrix for Artist classification. Confusions are meaningful only when seen in color. 

Confusion matrix for artist classification

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

20

22

Figure 7: Confusion matrix for Artist classification. Confusions are meaningful only when seen in 
color.


86 DAH-Journal

Large-scale Classification

Artist Classification: For the task 
of the artist classification, we trained 
one-vs-all SVM classifiers for each 
of 23 artists. For each test image, we 
determine its artist by finding the 
classifier that produces the maximum 
confidence. Table 4 shows the per-
formance of different combinations 
of features and metrics for this task. 
In general learning Boost metric 
improves artist classification better 
than all other metrics, except the case 
of CNN features where learning ITML 
metric gained the best performance. 
We plotted the confusion matrix of this 
classification task in figure 7. In this 
plot, some confusions between artists 
are clearly reasonable. We investigated 
two cases: 

First case, “Claude Monet” (5th row) 
and “Camille Pissaro”(3rd column). 
Both of these Impressionist artists who 
lived in the late nineteen and early 
twentieth centuries. Interestingly, 
based on art history literature Monet 
and Pissaro became friends when they 
both attended the” Acade ́mie Suisse” in 
Paris. This friendship lasted for a long 
time and resulted in some noticeable 
interactions between them. Second 
case, paintings of “Childe Hassam” 
(4th row) are mostly confused with 
ones from “Monet” (5th column). This 
confusion is acceptable as Hassam is an 
American Impressionist, who declared 
himself as being influenced by French 
Impressionists. Hassam called himself 
an “Extreme Impressionist”, who 
painted some flag-themed artworks 
similar to Monet. 

By looking at reported performances 
in tables 2-4, we conclude that, all 
three classification tasks can benefit 
from learning the appropriate metric. 
This means that we can improve the 
accuracy of baseline classification by 
learning metrics independent of the 
type of visual feature or the concept 
that we are classifying painting based 
on. Experimental results show that, 
independent of the task, NCA and 
MLKR approaches are performing 
worse than other metrics. Additionally, 
Boost metric always gives the best 
or the second best results for all 
classification tasks. 

Regarding analysis of importance of 
features, we can verify that Classeme 
and Picode features are better image 
representations for classification 
purposes. Based on these classification 
experiments, we claim that Classemes 
and Picodes features perform better 
than CNN features. This is rooted in 
the fact that amount of supervision 
for training Classeme and Picodes 
is more than CNN training. Also, 
unlike Classeme and Picodes, CNN 
feature is designed to categorize the 
object insides a given bounding box. 
However, in the case of paintings we 
cannot assume that all the bounding 
boxes around the objects are given. 

Integration of Features and Metrics 
We investigated the performance of 
different metric learning approaches 
and visual features individually. In the 
next step, we find out the best per-
formance for aforementioned clas si-


 DAH-Journal 87

Large-scale Classification

fication tasks by combining different 
visual features. Toward this goal, 
we followed two strategies. First, 
for a given metric, we project visual 
features by applying the metric and 
concatenate these projected visual 
features together. Second, we fixed 
the type of visual feature that we use 
and project it with the application of 
different metrics and concatenate these 
projections all together. Having this 
larger feature vectors (either of two 
strategies), we train SVM classifiers 
for three tasks of Style, Genre and 
Artist classification. Table 6 shows the 
results of these experiments where 
we followed the earlier strategy and 
table 5 shows the results of the later 
case. In general we get better results 
by fixing the metric and concatenating 
the projected feature vectors (first 
strategy). 

The work of Bar et al [4] is the 
most similar to ours and we compare 
our final results of these experiments 
with their reported performance. [4] 
only performed the task of style classi-
fication on half of the images in our 
data  set and achieved the accuracy 
of 43% by using two variations of Pi-
CoDes features and two layers of CNN. 
However, we outperform their ap-
proach by achieving 45.97 % accuracy 
for the task of style classification when 
we used LMNN metric to project GIST, 
Class eme, PiCoDes and CNN features 
and concatenate them all together as it 
is reported in the third column of table 
6. 

Our contribution goes beyond 
outperforming state-of-the-art by 
learning a more compact feature 
representation. In this work, our best 
per formance for style classification 
happens when we concatenate four 
100-dimensional feature vectors. This 
results in a 400 dimensional feature 
vectors that we train SVM classifiers 
on top of them. However [4] extract 
a 3882 dimensional feature vector to 
their best reported performance. As 
a result we not only outperform the 
state-of-the-art, but presented a better 
image representation that reduces the 
amount of space by 90%. Our efficient 
feature vector is an extremely useful 
image representation that gains the 
best classification accuracy and we 
consider its application for the task of 
image retrieval as future work. 

To qualitatively evaluate extracted 
visual features and learned metrics, 
we did a prototype image search task. 
As the feature fusion with application 
of LMNN metric gives the best per-
formance for style classification, we 
used this setting as our similarity 
measurement model. Figure 8 shows 
some sample output of this image 
search task. For each pair, the image 
on the left is the query image, which 
we find the closest match (image on 
the right) to it based on LMNN and 
feature fusion. However, we force the 
system to pick the closest match that 
does not belong to the same style as 
the query image. This verifies that 
although we learn the metric based on 
style labels, the learned projection can 
find similarity across styles. 


88 DAH-Journal

Large-scale Classification

5 Conclusion and 
Future Works 
In this paper we investigated the applicability of metric learning ap-
proaches and performance of different 
vi su al features for learning similarity 
in a collection of fine-art paintings. We 
implemented meaningful metrics for 
measuring similarity between paint-
ings. These metrics are learned in a 
su per vised manner to put paintings 
from one concept close to each other 
and far from others. In this work we 
used three concepts: Style, Genre and 
Artist. We used these learned metrics 
to transform raw visual features into 
an other space that we can significantly 
im prove the performance for three im-
por tant tasks of Style, Genre and Artist 
classification. We conducted our com-
par ative experiments on the largest 
pub licly available dataset of fine-art 
paint ings to evaluate the performance 
for the aforementioned tasks. 

We conclude that: 

1. Classeme features show the superior 
per for mance for all three tasks of Style, 

Genre or Artist classification. This su-
per ior performance is independent 
of the type of metric that has been 
learned.  

2. In the case of working on individual 
type of visual features, Boost metric 
and Information Theoretic Metric 
Learning (ITML) approaches improve 
the accuracy of classification tasks 
across all features.  

3. For the case of using different types 
of features all together (feature fusion), 
Large-Margin Nearest-Neighbor 
(LMNN) metric learning achieves the 
best performance for all classification 
experiments.  

4. By learning LMNN metric on 
Classeme features, we find an 
optimized representation that not only 
out performs state-of-the art for the 
task of style classification, but reduce 
the size of feature vector by 90%.  We 
con sider verification of applicability 
of this representation for the task of 
image retrieval and recommendation 
systems as future work. As other 
future works we would like to learn 
metrics based on other annotation (e.g. 
time period).  

Bibliography
[1] A. E. Abdel-Hakim and A. A. Farag. C-sift: A sift descriptor with color invariant 

characteristics. In IEEE Conference on Computer Vision and Pattern Recognition, 
CVPR, 2006.  

[2] R. Arnheim. Visual thinking. University of California Press, 1969.  
[3] R. S. Arora and A. M. Elgammal. Towards automated classification of fine-

art  painting style: A comparative study. In ICPR, 2012.  
[4] Y. Bar, N. Levy, and L. Wolf. Classification of artistic styles using binarized features 

derived from a deep neural network. 2014.  


 DAH-Journal 89

Large-scale Classification

[5] A. Bentkowska-Kafel and J. Coddington. Computer Vision and Image Analysis 
of  Art: Proceedings of the SPIE Electronic Imaging Symposium, San Jose 
Convention Center, 18-22 January 2010. PROCEEDINGS OF SPIE. 2010.  

[6] I. E. Berezhnoy, E. O. Postma, and H. J. van den Herik. Automatic extraction of 
brush stroke orientation from paintings. Machine Vision and Applications,  20(1):1–
9, 2009.  

[7] A.Bergamo andL.Torresani. Classemes and other classifier-based features for 
efficient object categorization. IEEE Transactions on Pattern Analysis and 
Machine  Intelligence, page 1, 2014.  

[8] A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes: Learning a compact  code 
for novel-category recognition. In Advances in Neural Information Processing 
Systems, pages 2088–2096, 2011.  

[9] G. Carneiro, N. P. da Silva, A. D. Bue, and J. P. Costeira. Artistic image classification: 
An analysis on the printart database. In ECCV, 2012.  

[10] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 
International Conference on Computer Vision & Pattern Recognition, volume 2, 
pages 886–893, June 2005.  

[11] J.V.Davis, B.Kulis, P.Jain, S.Sra, and I.S.Dhillon. Information-theoretic metric 
learning. In ICML, 2007.

[12] M. V. FahadShahbazKhan, Joostvande  Weijer. Whopaintedthis painting?, 2010.
[13]  L. Fichner-Rathus. Foundations of Art and Design. Clark Baxter, 2008.  
[14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood 

components analysis. In NIPS, 2004.  
[15] C. R. Johnson, E. Hendriks, I. J. Berezhnoy, E. Brevdo, S. M. Hughes,  I. Daubechies, 

J. Li, E. Postma, and J. Z. Wang. Image processing for artist identification. Signal 
Processing Magazine, IEEE, 25(4):37–48, 2008.  

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep 
convolutional neural networks. In Advances in neural information processing 
systems, pages 1097–1105, 2012.  

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied  to 
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.  

[18] J. Li and J. Z. Wang. Studying digital imagery of ancient paintings by mixtures of 
stochastic models. Image Processing, IEEE Transactions on, 13(3): 340–353,  2004. 

[19] J. Li, L. Yao, E. Hendriks, and J. Z. Wang. Rhythmic brushstrokes distinguish 
van gogh from his contemporaries: Findings via automated brushstroke extraction. 
IEEE Trans. Pattern Anal. Mach. Intell., 2012.  

[20] T. E. Lombardi. The classification of style in fine-art painting. ETD Collection for 
Pace University. Paper AAI3189084., 2005.  

[21] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International 
Journal of Computer Vision, 2004.  

[22] S. Lyu, D. Rockmore, and H. Farid. A digital technique for art authentication. 
Proceedings of the National Academy of Sciences of the United States of America, 
101(49):17006–17010, 2004.  


90 DAH-Journal

Large-scale Classification

[23] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation 
of the spatial envelope. IJCV, 2001.  

[24] G. Polatkan and S. Jafarpour and A. Brasoveanu and S. Hughes and L. Daubechies 
.Detection of forgery in paintings using supervised learning. In 16th IEEE 
International Conference on Image Processing (ICIP), 2009.  

[25] R. Sablatnig, P. Kammerer, and E. Zolda. Hierarchical classification of paintings 
using face- and brush stroke models. 1998.  

[26] B., K. Abe, and A. Elgammal. Knowledge discovery of artistic influences: A metric 
learning approach. In ICCC, 2014.  

[27] C. Shen, J. Kim, L. Wang, and A. van den Hengel. Positive semi-definite metric 
learning using boosting-like algorithms. Journal of Machine Learning Research, 
13:1007–1036, 2012.  

[28] D. G. Stork. Computer vision and computer graphics analysis of paintings and 
drawings: An introduction to the literature. In Computer Analysis of Images and 
Patterns, pages 9–24. Springer, 2009.  

[29] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category recognition 
using classemes. In ECCV, 2010.  

[30] A.  Vedaldi and K. Lenc. MatConvNet : convolutional neural networks for 
MATLAB. CoRR, abs/1412.4564, 2014.  

[31] K. Weinberger and G. Tesauro. Metric learning for kernel regression. In Eleventh 
international conference on artificial intelligence and statistics, pages 608–615, 
2007.  

[32] K. Weinberger and L. K. Saul. Distance metric learning for large margin nearest 
neighbor classification. JMLR, 2009.  

Tables
Table 1: List of Styles, Genres and Artists in our collection of fine-art paintings. Numbers in the 
parenthesis are index of the row/column in confusion matrices 5, 6& 7 accordingly.


 DAH-Journal 91

Large-scale Classification

Table 2: Accuracy for the task of style classification.

Metric / 
Feature

GIST Classemes Picodes CNN Dimension

Baseline 10.83 22.62 20.76 12.32 512
Boost 16.07 31.77 28.58 15.18 512
ITML 13.02 30.67 28.42 15.34 512
LMNN 12.54 27 24.14 16.83 100
MLKR 12.65 24.12 14.86 12.63 512
NCA 13.29 28.19 24.84 16.37 27

Table 3: Accuracy for the task of genre classification.

Metric / 
Feature

GIST Classemes Picodes CNN Dimension

Baseline 28.10 49.98 49.63 35.14 512
Boost 31.01 57.87 57.35 46.14 512
ITML 33.10 57.86 57.28 46.80 512
LMNN 39.06 54.96 54.42 49.98 100
MLKR 32.81 54.29 42.79 45.02 512
NCA 30.39 51.38 52.74 49.26 27

Table 4: Accuracy for the task of artist classification.

Metric / 
Feature

GIST Classemes Picodes CNN Dimension

Baseline 17.58 45.29 45.82 20.38 512
Boost 25.65 57.76 55.50 29.65 512
ITML 19.95 51.79 53.93 31.04 512
LMNN 20.41 53.99 53.92 30.92 100
MLKR 21.22 49.61 19.54 21.77 512
NCA 18.80 53.70 53.81 22.26 27


92 DAH-Journal

Large-scale Classification

Table 5: Classification performance for metric fusion methodology. 

Task / Feature GIST Classeme Picodes CNN
Style 20.21 37.33 33.27 21.99
Genre 35.94 58.29 56.09 47.05 
Artist 30.37 59.37 55.65 33.62 

Table 6: Classification performance for feature fusion methodology. 

Task / Metric Boost ITML LMNN MKLR NCA
Style 41.74 45.05 45.97 38.91 40.61
Genre 58.51 60.28 58.48 55.79 54.82
Artist 61.24 60.46 63.06 53.19 55.83 

Table 7: Annotation of paintings in Figure 8. Each row corresponds to one pair of images, labeled 
with the name of painting, its style and its artist. First six rows correspond to the six pairs on the 
left in Figure 8 and next six rows correspond to the pairs


 DAH-Journal 93

Large-scale Classification

Babak Saleh is a PhD candidate in the department of computer science at Rutgers 
University, where he conducts research in the intersection of computer vision, 
machine learning, and human perception. Inspired by human visual perception, 
he has developed computational models for measuring typicality of an image 
and its application in learning more robust visual classifiers. He holds a MS in 
Computer Science and a second MS in Statistics from Rutgers University. He 
completed his undergraduate studies in Computer Science and Mathematics at 
Sharif University of Technology in Tehran, Iran. He is the recipient of outstanding 
student paper award from AAAI 2016, and NSF I-Corps award. His research has 
been recognized by major media and press outlets, including NBC News, PBS, 
New York Times, Washington Post, WIRED, Fast Company and IEEE MultiMedia. 

Correspondence e-mail: babaks@cs.rutgers.edu

Dr. Ahmed Elgammal is an associate professor at the Department of Computer 
Science, Rutgers, the State University of New Jersey. He is a member of the 
Center for Computational Biomedicine Imaging and Modeling (CBIM) at Rutgers 
and affiliate member in Rutgers University Center for Cognitive Science (RUCCS.) 
and the director of the Art and Artificial Intelligence at Rutgers and the Human 
Motion Analysis Lab (HuMAn Lab.) 

Correspondence e-mail: elgammal@cs.rutgers.edu

mailto:babaks%40cs.rutgers.edu?subject=DAH-Journal
mailto:elgammal%40cs.rutgers.edu?subject=DAH-Journal

	Contents
	Editorial
	Big Image Data as new research opportunity  in Art History
	Big Image Data within the  Big Picture of Art History

	Figuring out Art History
	Showing Digitized Corpora
	Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature
	Gugelmann Galaxy. An Unexpected Journey through a collection of Schweizer Kleinmeister

	Artistic  Data and Network Analysis
	Images as Data: Cultural Analytics and Aby Warburg’s Mnemosyne
	Social Network Centralization Dynamics in Print Production in the Low Countries, 1550-1750

	Interview
	In Conversation with George Legrady: Experimenting with Meta Images.

	Case Studies
	Direct visualization techniques for the analysis of image data: the slice histogram and the growing 
	Linking structure, texture and context in a visualization of historical drawings by Frederick Willia

	Workshops
	Computing Art
	Visualizing Venice