Automatic Characterization

of Music Complexity:

a multi-faceted approach

by

Sebastian Streich

Submitted in partial fulfilment of the requirements for

the degree of Diploma of Advanced Studies

Doctorate in Computer Science and Digital Communication

Department of Technology

Tutor: Dr. Xavier Serra

Universitat Pompeu Fabra

Barcelona, July 2005


“I think the next century will be the century of complexity.”

Stephen Hawking (January 2000)


Abstract

The aim of this work is to present the concept of a multi-faceted music complexity

descriptor set. The complexity of music is one of the less intensively researched

areas in music information retrieval. Especially an automated estimation based

on the audio material itself has not been addressed by many researchers. How-

ever, it is not only a very interesting and challenging topic, it also allows for very

practical and relevant applications in music information retrieval.

After giving some background information about the motivation for this re-

search, we will discuss examples of practical applications, such as collection visu-

alization, playlist generation, and music recommendation. In a review of former

work we will see existing models for melodic, rhythmic, and harmonic complexity

facets. Since these are not directly applicable for our needs, we will then give a

set of operational definitions for the author’s approach to the problem. Prelim-

inary results for rhythmic and acoustic complexity are reported and discussed.

From these we sketch the steps for future work within and beyond the scope of

a PhD thesis.

i


Acknowledgements

At the first place I would like to thank Dr. Xavier Serra, my supervisor, for giving

me the opportunity to work on this very interesting topic in his team. Also, and

specially, I want to thank Perfecto Herrera for providing countless suggestions

and constant support for my work in this research project.

Further thanks go to my colleagues from Office 316, Beesuan Ong, Emilia

Gómez, Fabien Gouyon, and Enric Guaus for many fruitful discussions and im-

portant feedback. I also want to thank the other people – directly or indirectly –

involved in the SIMAC research project for their appreciated cooperation.

This research was funded by a scholarship from Universitat Pompeu Fabra

and by the EU-FP6-IST-507142 project SIMAC. I am thankful for this support,

which was a crucial economic basis for my research.

A special thank goes to the nice people that are my family whose support

and encouragement was and is always very important for me.

Barcelona Sebastian Streich

July 13, 2005

ii


Contents

1 Introduction 1

1.1 Background and Context . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Today’s Music Consumption Behaviour . . . . . . . . . . . . . . . 2

1.2.1 Music Tracks as Data Files . . . . . . . . . . . . . . . . . . 3

1.2.2 Problems with Digital Collections . . . . . . . . . . . . . . 4

1.2.3 Semantic Descriptors as a Perspective . . . . . . . . . . . . 5

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Working Thesis and Procedure 8

3 Music Complexity as a Semantic Descriptor 10

3.1 What is Complexity? . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Complexity in Information Theory . . . . . . . . . . . . . 11

3.1.2 A Philosophical Point of View . . . . . . . . . . . . . . . . 13

3.1.3 Perceptual Foundations of Music Complexity . . . . . . . . 14

3.2 Applicability of Music Complexity . . . . . . . . . . . . . . . . . . 16

3.2.1 Enhanced Visualization and Browsing . . . . . . . . . . . . 17

3.2.2 Playlist Generation . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 Song Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Review of Former Work 21

4.1 The Preferred Level of Complexity . . . . . . . . . . . . . . . . . 21

4.1.1 Psychological Experiments . . . . . . . . . . . . . . . . . . 21

iii


CONTENTS iv

4.1.2 Musicological Studies . . . . . . . . . . . . . . . . . . . . . 24

4.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Existing Models for Melodic Complexity . . . . . . . . . . . . . . 25

4.2.1 The Implication-Realization Model . . . . . . . . . . . . . 25

4.2.2 Enhancements of the Model . . . . . . . . . . . . . . . . . 27

4.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Measures for Rhythmic Complexity . . . . . . . . . . . . . . . . . 28

4.3.1 The PS-Measure . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.2 Danceability . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Harmonic Complexity . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 Preference Rule System . . . . . . . . . . . . . . . . . . . 31

4.4.2 Rewriting Rules . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.1 Information-based Complexity . . . . . . . . . . . . . . . . 33

4.5.2 Complexity of Short Musical Excerpts . . . . . . . . . . . 34

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Own Approaches 36

5.1 Operational Definitions . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.1 Acoustic Complexity . . . . . . . . . . . . . . . . . . . . . 36

5.1.2 Timbral Complexity . . . . . . . . . . . . . . . . . . . . . 38

5.1.3 Melodic Complexity . . . . . . . . . . . . . . . . . . . . . 39

5.1.4 Rhythmic Complexity . . . . . . . . . . . . . . . . . . . . 40

5.1.5 Harmonic Complexity . . . . . . . . . . . . . . . . . . . . 41

5.1.6 Structural Complexity . . . . . . . . . . . . . . . . . . . . 42

5.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.1 The Dynamic Component of Acoustic Complexity . . . . . 42


CONTENTS v

5.2.2 Danceability . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . 55

6 Future Work 57

6.1 PhD Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.1 Filling the Gaps . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1.2 Evaluation Strategies . . . . . . . . . . . . . . . . . . . . . 63

6.1.3 Efficiency Aspects . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 Beyond the Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Conclusion 67

List of Figures 69

Bibliography 70


Chapter 1

Introduction

This chapter contains some remarks on the background, context, and motivation

of the research presented and the researcher presenting in this document. Also,

some clarifications about the terminology are given, while a concise demarcation

of the topic under research will be done in chapter 2.

1.1 Background and Context

The author obtained the German academic degree of Diplom-Ingenieur (graduate

engineer) in Media Technology from Ilmenau Technical University in 2002. Al-

ready during the last years of his studies he specialized on the field of digital audio

signal processing. In his student research project in 2000 the author worked in a

group that ported an audio watermark decoder to a DSP platform. Later, during

his internship at NOKIA Research Center in Tampere, Finland, in 2001 he had

the opportunity to get in touch with audio codec technology [VSV+03, WS02].

He graduated with a thesis on the design of a speaker identification system for

the Electronic Media Technology Team of Fraunhofer Gesellschaft.

After graduation, the author was employed by Fraunhofer Gesellschaft as a

research engineer. His work was centered around metadata extraction from dig-

ital audio with a focus on melody and chord estimation. In September 2003

he joined the Music Technology Group1 (MTG) of Pompeu Fabra University in

Barcelona, Spain, where he is enrolled in the PhD program in Computer Sci-

ence and Digital Communication and holds a university scholarship. The MTG

1http://www.iua.upf.es/mtg/

1


CHAPTER 1. INTRODUCTION 2

consists of around 40 researchers, mostly PhD candidates, who are working on

music signal processing, music information retrieval, musician-computer inter-

action, and music software engineering tasks. Dr. Xavier Serra, the head and

founder of the group, has also supervised the research presented here.

Another very important part of the research context is the EU-FP6-IST-

507142 project SIMAC (Semantic Interaction with Music Audio Contents). The

project was initiated by the MTG and started in January 2004. As the project

leader, MTG is heavily involved in administration and research. The author is

one of six researchers inside MTG who, lead by the research manager Perfecto

Herrera, are working full-time for SIMAC. The project’s main goal is the devel-

opment of prototypes for the automatic generation of semantic descriptors and

the development of prototypes for exploration, recommendation, and retrieval of

music. Further information on the project can be found in the internet under

the URL http://www.semanticaudio.org.

The author is also a hobby musician with a classical piano education. Apart

from classical music for solo piano, he is also enjoying to play ensemble music from

Baroque to Bossanova Jazz. In music listening generally speaking the author has

no particular preference for certain genres or artists, but likes to select individual

tracks from a wide range of styles depending on the given mood and situation.

1.2 Today’s Music Consumption Behaviour

Music, in the western cultural world at least, is present in everybody’s life. It can

be found not only at its “traditional” places like opera houses or discotheques,

but appears in kitchens and nurseries, in cars and airports, bars and restaurants,

parks and sport centers and in many other places. With the increased mobility

of music reproduction devices nowadays everybody can bring his own personal

music along and listen to it almost anywhere. But this enhanced availability of

music also means that concentrated and active listening has become rare. Often

music serves as a background while the listener is doing something else, it is used

as an acoustical “gap filler”, or as an emotional trigger [NH97c]. New ways of

dissemination for music have been arising and never before has it been as easy as

today to get access to such a huge amount of music (more than a million titles

are available from a single music portal). Yet, it is not always easy to find what

one is looking for.


CHAPTER 1. INTRODUCTION 3

1.2.1 Music Tracks as Data Files

With a broader public becoming aware of efficient audio coding techniques dur-

ing the last decade, the amount of music being stored and collected in digital

formats increased rapidly. While in former times a music collection consisted of

a conglomeration of physical media in form of vinyl discs, analogue or digital

tapes, and later also compact discs, a new kind of collections emerged due to the

new formats. No longer is the music in these collections attached to a physical

medium, but exists merely as a string of bits that can, very easily and without

degradation in quality, be moved and copied from any digital storing device to

any other one. This property, apart from being very convenient, again reinforced

the rapid spread of the new formats, by making it as easy as never before to

share and exchange music with virtually anybody on the planet (at least with

the roughly 13% of our world’s population who have access to the internet).

Nowadays, the music industry managed to hinder the wild exchange of music

by filing lawsuits against operators and private users of such content exchange

networks. But digital music collections already exist in large numbers and the

concept of storing music tracks as digital files with all its advantages is an estab-

lished fact for many music consumers. Lately, legal options to buy digital music

online and receive it by file transfer over the World Wide Web are arising, often

combined with some sort of digital rights management that is supposed to pre-

vent consumers from “shamelessly” copying and sharing their new acquisitions.

Last but not least, alternative models of copyrighting and licensing musical con-

tent are emerging (e. g. creative commons2) and contribute a small but growing

part to the music that can be found and obtained from the internet. David

Kusek and Gerd Leonhard, two music futurists, even coin the term of “music as

water” [KL05] in order to describe the change of attitude towards music, music

consumption, and music ownership that is in the process of establishing itself:

“Imagine a world where music flows all around us, like water, or like

electricity, and where access to music becomes a kind of ‘utility’. Not

for free, per se, but certainly for what feels like free.”

2http://creativecommons.org/


CHAPTER 1. INTRODUCTION 4

1.2.2 Problems with Digital Collections

Despite the many advantages that came with this “digital revolution” we can

observe some negative effects, too. The flexibility that music tracks in form of

individual files provide goes along with a demand for a proper storage organiza-

tion and labelling. While in a physical collection an item might be declared lost

when it cannot be found, in a large virtual collection this can be said already

when there are no means to search for it. The freedom to create personalized

play lists and to listen to music tracks independently from concepts like albums

or compilations requires on the other hand a good knowledge of the collection at

hand in order to arrive at satisfying results. When typing errors in filenames, a

lack of systematics, and inconsistency in the manually assigned labels occur in

such a collection, navigation and utilization become difficult and limited.

Another aspect of bit strings representing music is their volatility and re-

producibility. With the move of a finger they can be erased and they are gone

without leaving any trace. Or they can be replicated with very little effort basi-

cally infinite times at no cost other than the disk space to store them. Especially

the latter has an impact on the value that is assigned to such an item. According

to the theory of supply and demand a commodity with finite demand but infinite

supply has a price of zero (of course this cannot be applied directly to music). As

an illustration: If I go into a record shop to buy an album and the salesman gives

me two copies for the price of one, I will consider this a true benefit, because I

can give one copy to a friend or resell it. If I buy the same album from an online

shop and they offer me to download it twice by paying only once, I would just

find it silly. This gives a good hint for understanding why people are so “gen-

erous” to offer their complete collection of music files for anybody to download

without any charge. And it also helps understanding why people do not feel shy

to download or copy music from others without paying and without feeling they

do something unjust. Especially in the times of free peer-to-peer music sharing

networks this configuration boosted the dissemination and the enforcement of

digital music formats, which are a matter of fact nowadays. On the other hand,

this excess of digital tracks decreased the appreciation for the individual item.

So some private collections have been extended for the sake of collecting rather

than because of a particular interest in the material.

Summarizing we can say that to date probably many hard disks exist which


CHAPTER 1. INTRODUCTION 5

contain a large collection of digital music, but the owner is ignorant about the

full potential of it. Even if he or she is not, it means a lot of effort to fully

exploit it and rather sooner than later the currently available tools reach their

limits. With the new way of music dissemination through online portals like for

example iTunes3 or Y!music4 the problem of searching, browsing, and navigating

a (unfamiliar) music collection is brought to an even larger scale.

1.2.3 Semantic Descriptors as a Perspective

Slowly, tools and technologies are starting to spread that intent to enhance and

facilitate the interaction with digital music collections. The key component of

such tools is the assignment of semantic descriptors to the tracks. These descrip-

tors capture certain properties and attributes of the music content, for example

the tempo, the instrumentation, or the lead singer’s gender. These properties

usually reveal themselves automatically to a human who is listening to a music

track and hence are potentially relevant information. Therefore a link between

the pure digital audio data and the semantic concepts describing the content of-

fers much more natural ways of searching in music collections than it is currently

possible. Instead of being limited to titles, artists, and genres as the means of a

query even very subtle or abstract aspects of music could be used.

Apart from the enhanced querying, also an extended interaction becomes

possible. Browsing through a collection, be it one’s own or a foreign one, where

different musical aspects are visualized (see figure 1.1) is not only amusing, but

also may lead to a better understanding of the music. Similarities between dif-

ferent styles or artists can be discovered, the evolution of a band can be tracked,

extreme examples can be identified. This playful and educational side effect could

then again lead to a more attentive way of music listening, increasing pleasure

and appreciation.

But not only for the human do semantic descriptors bring additional oppor-

tunities. If they are organized in a machine readable way, there is an access

for information processing devices to the properties of music which are relevant

for a human listener. Basically, this allows computers to mimic a human’s mu-

sic perception behaviour. So instead of the human browsing the collection, the

3http://www.apple.com/itunes/
4http://launch.yahoo.com/


CHAPTER 1. INTRODUCTION 6

Figure 1.1: Screenshot from a music browser interface displaying part of a song collection

organized by danceability and dynamic complexity.

computer could do this automatically and identify for example clusters of sim-

ilar tracks. Thus, the computer is enabled to generate play lists according to

given criteria or to recommend to the user similar tracks to a selected example

track. So we see that providing machine readable semantic descriptors for the

tracks in a collection opens the door for a large variety of interesting methods of

sorting, searching, filtering, clustering, classifying, and visualizing. But how can

we arrive there? Different ways exist to assign semantic descriptors to music. It

is possible (although expensive) to have a group of music professionals annotate

them. This is of course usually not an option for private collections. Still, the

metadata - once annotated - could be stored in a public database and would then

be associated through a fingerprinting service with the files in a private collec-

tion. To some extent the descriptors can also be assigned by a whole community

of listeners by majority vote or by using collaborative filtering techniques. This

involves quite some coordinative efforts and furthermore involves a delay until

reliable data for a new track becomes available. The most practical and versatile

option however is the automatic computation of descriptors based on the audio

file itself. This way, an objective, consistent, inexpensive, and detailed annota-

tion can be accomplished. The research presented here is about this third option

of descriptor extraction.


CHAPTER 1. INTRODUCTION 7

1.3 Thesis Outline

After this introduction a very compact description of the research procedure will

be given in chapter 2 defining the scope of this research. In the following chapter

3 we will then look at different views on complexity and discuss the details of the

underlying concept for music complexity as we want to approach it. Afterwards,

a review of related research is presented in chapter 4. We will see that several

models capturing aspects of music complexity were already proposed in the liter-

ature, but that their application to the task we have in mind is problematic or at

least not straightforward. In chapter 5 we first consider in more detail how the

different facets of music complexity can be approached from the computational

perspective. A description of two original contributions of the author follows.

The implementations of acoustic and rhythmic complexity models (danceability)

are explained, and first results are discussed. In chapter 6 the roadmap for the

PhD thesis and some ideas reaching beyond will be rolled out. The document

concludes with a bottomline of the research presented here.


Chapter 2

Working Thesis and Procedure

It is the goal of this research work to provide operational models for the auto-

mated computation of music complexity as it is perceived1 by human listeners.

We regard the complexity of music as a high-level, intuitive attribute, which

can be experienced directly or indirectly by the active listener, so it could be

estimated by empirical methods. In particular we define the complexity as that

property of a musical unit which determines how much effort the listener has to

put into following and understanding it (see also section 3.1).

The proposed models are intended to provide a compact description of the

musical content, which can be utilized to facilitate the user interaction with

music databases. Since music has different and partly independent facets, we

will address these individually with separate models (see also section 5.1). We

consider the following musical facets, that are also illustrated in figure 2.1:

� Melody

� Harmony

� Rhythm

� Timbre

� Acoustics (spatial/dynamic)

� Structure
1Throughout this document we will use the term “perception” although “cognition” might

be considered more appropriate at some places. However, we want to omit this distinction here

and consecutively use “perceptual” as the contrasting term to “technical”.

8


CHAPTER 2. WORKING THESIS AND PROCEDURE 9

Figure 2.1: Circle of music complexity facets.

The descriptors are going to be designed for the track level. That means a

complete track of music is considered the unit on which the models are working.

The segmentation into tracks is assumed to be given already, so the application

to a stream (as in broadcast monitoring) is not addressed.

Neither will we address the individual characteristics of users’ complexity

perceptions. Instead it is the goal to provide models for the “common sense” in

music complexity rating (see section 3.1.3). We want to focus rather on a näıve

complexity estimation without taking expert knowledge about style-specific mu-

sical peculiarities into account. That means we are interested in a classification

on a large, but rougher scale rather than a precise ranking taking finest nuances

into account. For the latter we would have to focus in detail on a particular

user’s musical background and listening habits. The former can be achieved by

assuming a common background for a certain group of users and then restrict-

ing the validity of the provided models only to this group. We chose the group

of non-expert music listeners with western cultural background as the intended

users for our models, since it forms a large fraction of the users dealing with the

above mentioned music databases.


Chapter 3

Music Complexity as a Semantic

Descriptor

Semantic descriptors for music content can come in many forms, because music

is rich in aspects and levels of abstraction. This said we immediately arrive

at a new problem. Which are useful semantic descriptors to be used? There

is no obvious solution to this problem and most likely there is more than just

one answer depending on the application and maybe even on the content itself.

Obviously the extracted descriptors should capture features of the music that

are relevant to human listeners. Furthermore, it would be considered positively,

if the descriptors are compact in their representation, and their meaning is easy

to be comprehended. Of course there are a lot of descriptors fulfilling these

demands. In the following we will mainly consider one small fraction of them:

the multiple facets of music complexity.

3.1 What is Complexity?

The term complexity is understood differently in different contexts. It is a term

that everybody seems to know, but still a unified definition is difficult to find.

Furthermore, it is a fancy term to use in order to raise interest, which Edmonds in

[Edm95] describes with the following metaphor: “The label of ‘complexity’ often

performs much the same role as that of the name of a desirable residential area in

Estate Agent’s advertisements. It is applied to many items beyond the original

area but which are still somewhere in the vicinity.” So, despite a compact working

10


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 11

definition for the use in this research work has already been given in chapter 2,

we dedicate some space in this section to cover the scientific background of the

term, and to clarify our understanding in the context of music.

3.1.1 Complexity in Information Theory

In information theory there exist different measures that can be associated with

complexity. The most “traditional” one is probably the concept of source entropy

introduced in 1948 by Shannon [Sha48]. Entropy measures the randomness of

the output from an information source, by specifying the average amount of

information for each symbol si emitted by the source according to the following

formula:

H0 = −
N∑

i=1

p(si) · log2 p(si), (3.1)

where N is the number of distinct symbols and p(si) is the probability that

symbol si is emitted. High entropy values would be associated with higher com-

plexity, since the information rate is higher. Equation 3.1 assumes a memoryless

source, like the flipping of a coin or the rolling of a dice, where the preceding

symbol in the sequence has no influence at all on the symbols to follow. This

assumption is not justified in many practical applications. For structured data

as texts for example, the probability of a symbol depends strongly on its prede-

cessors. In information theory the terms Markovian source and Markov chain

are used in these cases. Equation 3.1 can be adapted for a Markov chain by tak-

ing the corresponding conditional probabilities into account. If only the direct

predecessor is relevant (as in a first order Markov chain), the equation takes the

following form:

H1 = −
N∑

i=1

p(si)
N∑

j=1

p(sj|si) · log2 p(sj|si), (3.2)

where p(sj|si) denotes the conditional probability of symbol sj appearing after
symbols si in the sequence. H0 is the upper bound for H1 and any higher

order entropy of a given source, since the structural regularities captured by

the conditional probabilities can only decrease randomness.

The most widespread definition of complexity in the information theory field

originates from the theory of algorithmic information. Referring to one of the

main scientists behind it this definition is usually called Kolmogorov complexity


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 12

(see [GV99] for a short introduction). In contrast to entropy it addresses the

absolute information content of an object, which is the amount of data that

needs to be transmitted in the absence of any other a priori knowledge. It is

defined as the length of the shortest program p (in bits), that can generate the

sequence s. Formally we can write:

K(s) = min(|p| : T (p) = s), (3.3)

where T (p) refers to the output of a universal Turing machine executing p.

Thanks to its general nature Kolmogorov complexity has been applied also in

the digital audio domain (see [Sch01]) as a way to prove mathematically the

advantages of structured audio in generalized audio coding.

The striking quality of both concepts is the objectivity of the measures. Kol-

mogorov complexity and Shannon entropy both exist as exclusive properties of

the object and are completely independent from an observer. But when consid-

ering musical audio signals we face problems. First, the entropy measure needs a

finite set of discrete symbols being emitted from the source, but this is not what

we have. The signal would need to be converted somehow into symbols like the

letters in a written text, which is not very straightforward. Despite that music

can be generated with a keyboard it is usually much more than the output of a

“typewriter”1.

Secondly, both measures are rather technical focusing on an efficient encod-

ing, while we are interested in the reception by humans. A practical example

illustrating this difference very well is given by Standish in [Sta01]. He points out

that a random sequence of numbers will always yield a maximum complexity al-

though it does not contain any meaningful information for a human. Comparing

a random string of characters with the manuscript of a theater play, the random

sequence would yield the higher Kolmogorov complexity value, because there is

no program shorter than the sequence itself. However, for a human the order

would be exactly reversed, because the random string is perceived as meaning-

less noise, while the theater play is recognized as a sophisticated, multi-layered

setting (at least if it is a good one). The apparent objectivity of the measure

makes it meaningless when context has to be considered.

Standish suggests the use of equivalence classes to overcome this. An equiv-

1This can be experienced easily by comparing a monophonic mobile phone ringing tone with

a live performance of the same piece.


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 13

alence class for him is the set of all mutations of a sequence that are equivalent

in a given context. So for example different random sequences could hardly be

distinguished by a human observer and would therefore form a large equivalence

class. On the other hand, for a written text carrying a meaning, only relatively

few mutations exist that would be judged as equivalent. This judgement depends

not only on the data itself but also the context in which it is observed. If the

equivalence classes are considered in the complexity computation, we arrive at

more meaningful results from a human’s point of view. But still we do not have

a practical solution. The decision of what is equivalent and what is not does not

seem to be very straightforward, especially when it comes to music. Further-

more there is no closed mathematical solution for computing the Kolmogorov

complexity and it remains a rather theoretical measure.

3.1.2 A Philosophical Point of View

Edmonds, who is reviewing the matter of complexity from the philosophical point

of view [Edm95], gives us the following definition: Complexity is “[t]hat property

of a language expression which makes it difficult to formulate its overall behav-

iour, even when given almost complete information about its atomic components

and their inter-relations.”

This is a very general statement leaving (purposely) a lot of room for inter-

pretation depending on the given context where it is to be applied. Although

it is not explicit in the statement, Edmonds considers a subjective element of

complexity, because what “makes it difficult” can vary with the context and the

observer. We can agree on this definition also for the case of musical complexity,

yet, it does not bring us much closer to an implementable model. How do we

identify the aspects of music that make the formulation of the overall behaviour

difficult? What are the atomic components we should consider in this context?

However, we can take out at least two ideas from this statement. First, we can

think of complexity as being a property that varies along an axis between easi-

ness and difficulty. Second, it is a kind of meta-descriptor, because it describes

a certain property of lower-level descriptors (atomic components) when they are

arranged to form a bigger unit. So by nature a complexity descriptor is com-

pact even though it might contain information about a wide-stretched temporal

process.


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 14

In the case of music it is not unambiguous to speak of a unit. Music can

be structured on different levels. For the applications we have in mind it is

advantageous to consider the music tracks that can be found in a collection as

the units whose complexity is to be described. This way, we can avoid additional

problems of segmentation (still an unsolved problem for musical audio [OH04])

and simply rely on the assumption that one digital file corresponds to one musical

unit.

It is worthwhile mentioning that it could also make sense to go below the

track level and provide complexity estimates for smaller fractions. In fact, an

instant computation of complexity estimates in a sliding window fashion itself

could be helpful in segment identification or other tasks related to very high-

level aspects of music. However, we save these considerations for the chapter on

Future Work.

3.1.3 Perceptual Foundations of Music Complexity

If we move our focus more from the general towards the specific, we can find

the following remark on music complexity by Finnäs. In [Fin89] he states that

“unusual harmonies and timbres, irregular tempi and rhythms, unexpected tone

sequences and variations in volume” raise the level of perceived complexity. This

statement is neither exhaustive nor precise. However it brings some aspects into

play, which are special for music and therefore interesting for us.

We can see that Finnäs mentions very different facets of music (harmony,

rhythm, volume, . . . ) that are at least partly independent from each other.

Nevertheless, they all are or can be of relevance for the level of musical complexity.

For example one music track can contain very sophisticated rhythm patterns, but

no melodic voice at all. Another one might have unexpected changes in volume

and timbre, but very straightforward melody and chord sequences. Which one

should be rated more complex? Looking at the problem this way calls for a

multi-faceted approach with an individual complexity estimation for each facet,

as we already proposed in chapter 2. A joint measure of global complexity does

not necessarily make sense under these circumstances.

The terms “unusual”, “irregular”, and “unexpected” again lead to context

dependency and subjectivity. Whether an event is expected or unexpected is not

completely immanent in the event itself. In extreme cases both could be true,


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 15

depending on who we ask. This is of course not very encouraging for somebody

who is supposed to develop a computer model that gets only the event descrip-

tion (in our case obtained from the audio data) as an input. It is obvious that

not everybody will share the same opinion about the complexity of a particular

piece of music, depending on his or her individual musical knowledge and the

familiarity with special musical styles. However, if we restrict ourselves – as pro-

posed in chapter 2 – to only deal with individuals from a certain cultural heritage

(i. e. eliminating influences from different tuning or scale systems, etc.) we can

still identify something like a musical “common sense”. It can be understood

for example as the ability to judge whether a singer is singing in tune or not, to

clap on the beat, or to distinguish consonant from dissonant chords. Neurologi-

cal experiments for example by Barbara Tillmann et al. proved the existence of

such effects (e. g. in [TBB00]). Tillmann refers to the cause as implicit learning

of music by mere exposure. So simply by frequent exposure to music in their

everyday life humans unconsciously learn certain immanent regularities or rules

of the music. Violation of these rules then increases the effort a listener has to

put into “decoding” or processing the music and thus increases the perceived

complexity according to our definition.

At this point there is also a link to Gestalt theory, which assumes the existence

of certain universal principles like proximity, continuation, closure, etc. that are

supposed to be “hardwired” in our perception (see e. g. [Bre90]). These principles

have very similar effects as the implicitly learned rules, but they are given to us

already by birth and hence do not have to be induced through frequent exposure

to stimuli of a certain type. While learned rules are stored in long-term memory,

the gestalt principles are operational directly in short-term memory due to the

organization and design of our “operating system”. This implies that the Gestalt

principles do not even depend on cultural background and environment, but form

– in the widest sense of the word – a “common sense” for the perception of

stimuli. For example the implication-realization model for melodic complexity

by Narmour [Nar90], which will be reviewed in section 4.2, is built on these

principles.

We want to consider this “common sense” or implicit knowledge as the basis

for our complexity estimations. So rather than adapting to individual peculiar-

ities of single music listeners we want to provide a general model, which is able

to make predictions on a large scale that everybody can agree on with a small


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 16

amount of error tolerance. It is however very interesting to follow the thought

of a learning complexity agent system adapting to its owner’s preferences and

advancing in its estimations along with him. We will get back to this idea in

chapter 6.

3.2 Applicability of Music Complexity

We now explained what we refer to with the term “music complexity”. After

these, up to this point, quite academic considerations we will now turn to the

practical motivation of the described concept. Why should musical complexity

descriptors be particularly interesting for the normal music consumer?

The answer has two parts. First, it is not too far fetched that certain facets

of complexity might be directly relevant for the listener. For example if I am

interested in finding danceable music for a party, the rhythmic complexity already

provides a useful parameter for my search. Or if I am looking for “easy listening”

music, I might restrict my search to tracks at the lower end of the complexity

scale on one or on several dimensions.

For the second part of the answer we have to go back to the year 1971 where

we find a publication by Daniel Berlyne [Ber71]. In this publication he states that

an individual’s preference for a certain piece of music is related to the amount

of activity it produces in the listener’s brain, to which he refers as the arousal

potential. According to this theory there is an optimal arousal potential that

causes the maximum liking, while a too low as well as a too high arousal potential

result in a decrease of liking. He illustrates this behaviour by an inverted U-

shaped curve (see figure 3.1) which was originally introduced in the 19th century

already by Wundt [Wun74] to display the interrelation between pleasure and

stimulus intensity.

Berlyne identifies three different categories of variables affecting arousal (see

[Ber71] for details). As the most significant he regards the collative variables,

containing among others complexity, novelty/familiarity, and surprise effect of

the stimulus. Since we are intending to model exactly these aspects of music with

our descriptor, it is supposed to be very well suited for reflecting the potential

liking of a certain piece of music. We will return to this topic and review several

related experiments in section 4.1.


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 17

Figure 3.1: The Wundt curve for the relation between music complexity and preference.

This said we can now flesh out the motivations for the use of semantic de-

scriptors given in section 1.2.3. When a user is interacting with a music database

or a music collection three major tasks can be identified:

1. Providing an appropriate interface for navigation and exploration.

2. The generation of a program (playlist) based on the user’s input.

3. The retrieval of songs that match the user’s desires.

We can identify applications of complexity descriptors in all three tasks, which

is discussed in the three following sections.

3.2.1 Enhanced Visualization and Browsing

For smooth navigation and exploration of databases a well-designed visualization

of the contents is a crucial condition. This is a very difficult task when it comes

to large amounts of complex data like music tracks. One example for such visual-

ization is the Islands of Music application, developed by Elias Pampalk [Pam01].

This application uses the metaphor of islands and sea to display the similarity

of songs in a collection. Similar songs are grouped together and represented by

an island, while dissimilar ones are separated from them through “the sea”. The

application uses features that are motivated from psychoacoustic insights, and

processes them through a self-organizing map (SOM). In order to compute sim-

ilarity between songs the sequence of instantaneous descriptor values extracted

from each song has to be shrunken down to one number. Pampalk does this by


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 18

taking the median. He reports satisfying results, but at the same time states

that the median is not a good representation for songs with changing properties

(e. g. bimodal feature distributions).

Here, the complexity descriptors have a clear advantage, because by default

they consist only of one number which represents the whole track and thus do

not need to be further reduced by basic statistical measures, like the mean or

the median. Obviously, complexity represents a self-contained concept and is not

intended to from an alternative to the use of such measures. As pointed out in

the beginning of this section, the different complexity descriptors reflect specific

characteristics of the music that are potentially of direct relevance for the listener.

The descriptors are therefore very well suited to facilitate the visualization of

musical properties the user might want to explore. It is straightforward to plot

the whole collection in a plane showing for example rhythmic complexity versus

loudness complexity without the need for specifying a similarity metric (see also

figure 1.1 on page 6).

Another aspect is the possibility of a more “musicological” way of interaction

with a music collection. By providing the link between the actual audio and

the musical content description a user might increase his knowledge about the

music in his own or a different collection. Common properties of music from

different artists or different genres might be discovered. The changes in musi-

cal characteristics over time for a particular band can be made visible. Also

here the complexity descriptors form an interesting addition, which opens new

opportunities that are still to be explored.

3.2.2 Playlist Generation

A playlist is a list of titles to be played like a musical program. A user interacting

with a database might ask for the automated generation of such a list. As Pachet,

Roy, and Cazaly point out in [PRC00] the creation of such a list has to be taken

serious, since “[t]he craft of music programming is precisely to build coherent

sequences, rather than just select individual titles.” A first step towards coherence

is to set certain criteria the songs have to fulfill in order to be grouped into one

playlist. The user could be asked to provide a seed song for the playlist and the

computer would try to find tracks from the database which have similar descriptor

values. Pachet, Roy, and Cazaly go further and look at an even more advanced


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 19

way of playlist generation capturing the two contradictory aspects of repetition

and surprise. Listeners have a desire for both, as they state, since constant

repetition of already known songs will cause boredom, but permanent surprise

by unknown songs will probably cause stress. In their experiments Pachet, Roy,

and Cazaly use a hand edited database containing, among others, attributes like

type of melody or music setup. We can see a correspondence here to melodic and

timbral complexity, that encourages the utilization of complexity descriptors for

playlist generation.

An alternative way of playlist generation, which gives more control to the

user, is that of using user specified high-level concepts as for example party-

music or music for workout. Inside the SIMAC project methods are explored

to arrive at such a functionality. A playlist could then be easily compiled by

selecting tracks with the according label. The bottleneck here is the labelling of

the tracks, which might be a lot of work in a big collection. Since the labels are

personalized and may only have validity for the user who invented them, there

is no way to obtain them from a centralized metadata service. Instead the user

can try to train the system to automatically classify his tracks and assigning the

personalized labels. For this process semantic descriptors are needed that help

in distinguishing whether a track should be assigned a certain label or not. It

depends of course very much on the nature of the label to identify descriptors

that are significant for this distinction. In any case, the complexity descriptors

certainly have a potential to be useful here, as can be seen from the examples at

the beginning of this section.

3.2.3 Song Retrieval

For song retrieval there are different possibilities in a music database. The most

obvious one is the direct specification of parameters by the user. Since the

complexity descriptors consist of only one value per track, they can be used

very easily in queries. The user can specify constraints only for those facets

he is interested in and narrow down the set of results. This way it is very

straightforward to find music that, for example, does not change much in loudness

level over time, or contains sophisticated chord patterns.

A second way of querying is the so called query-by-example approach. The

user presents one or several songs to the database and wants to find similar


CHAPTER 3. MUSIC COMPLEXITY AS A SEMANTIC DESCRIPTOR 20

ones. So, as explained for the visualization using similarity measures, here the

complexity descriptors can easily be integrated into the computation again. The

weighting and/or the tolerance for the different descriptors could be specified

by the user directly, extracted from the provided example, or taken from a pre-

computed user profile. Such a user profile would be established by monitoring the

user’s listening habits (i. e. songs he/she has in his/her collection; songs he/she

listens to very frequently, etc.) as in the recommender application developed in

the SIMAC project.

Finally, we can think of a music recommender that does not even need an

example song. If the user’s common listening behaviour is known it should be

possible to establish a “complexity profile”. This can be understood for example

as a histogram for the complexity values where either the number of tracks or the

listening frequency is monitored. From such a histogram it should be possible to

identify the user’s optimal complexity level in Berlyne’s sense (see figure 3.1 on

page 17). Tracks matching this level could then be selected as recommendations

for the user imitating the recommendations of friends with a similar musical

taste. It should be stated that complexity is of course not the only criteria that

should be used here, but according to Berlyne and others plays an important

role for potential preference. Some experiments on this relationship are reviewed

in section 4.1.


Chapter 4

Review of Former Work

After the subject of our research has been explained and specified, and the in-

tended applications have been described we will now have a closer look at the

work that has already been done in this area. We will first review several ex-

perimental findings in the context of Berlyne’s theory of arousal potential. Then

we address different complexity facets where former work is already available.

These are melodic, rhythmic, and harmonic complexities. Finally, we take brief

look at two alternative concepts of complexity.

4.1 The Preferred Level of Complexity

We pointed out in section 3.2 that according to Berlyne’s theory of arousal poten-

tial [Ber71] the level of perceived complexity of a piece of music can be associated

with the preference for it. Our research is not intending to provide evidence for

the validity of this theory. Rather than exploiting the psychological aspects be-

tween perceived complexity and preference, we focus on the development of a

computational model of complexity for music. However, the former aspect plays

an important role in the motivation for our research. Hence, we want to give

some room here to report about studies that have been published on the matter.

4.1.1 Psychological Experiments

Berlyne himself conducted several studies during the 1960s and 1970s ([Ber60],

[Ber71], and [Ber74]) on the connection between arousal potential and hedonic

21


CHAPTER 4. REVIEW OF FORMER WORK 22

value (liking) of artistic stimuli. He has considered not only auditory, but also

visual and combined stimuli in his experiments. He has found strong evidence

for the existence of the inverted-U relationship as depicted in figure 3.1 on page

17 which led to his theory of arousal potential. As mentioned above he identified

different types of variables to contribute to arousal potential ([Ber60] pp. 170–

179):

� Intensive Variables (e. g. pitch, color, energy)

� Affective Variables (e. g. anxiety, pain, gratification)

� Collative Variables (e. g. novelty, surprisingness, complexity)

Berlyne considered the collative variables the most important ones of the three

types.

With respect to music complexity in particular, Heyduk in [Hey75] reports

about an experiment with four custom-built piano compositions of different com-

plexity levels. Complexity was varied in two facets: chord structure and syncopa-

tion. He found strong evidence for Berlyne’s theory by analyzing the ratings for

preference and complexity given by a group of 120 subjects. While Berlyne used

the term arousal potential for the combination of different factors determining

preference, Heyduk follows the terminology of Walker [Wal73] and talks about

psychological complexity. It is important to note the distinction to a pure stim-

ulus complexity here. The latter would be fixed and objective, very much like

the Kolmogorov complexity mentioned in section 3.1.1. The former includes the

latter, but is also determined by attributes like novelty, uncertainty and arousal

properties. It is a subjective property, because it only becomes manifest in the

encounter of an individual with the stimulus.

Steck and Machotka [SM75] conducted an experiment using random se-

quences of sinusoidal tones as stimuli. All tones were equally loud and had

equal duration within one “composition” of approximately 10s length. 16 dif-

ferent levels of complexity were constructed by varying the tone duration (and

thus the tone rate) from 2s down to 60ms. The analysis of preference ratings

from 60 subjects revealed a clear inverted-U relationship on the objective com-

plexity levels. However, when presenting only subsets of the test samples which

were taken from adjacent complexity levels, again an inverted-U relationship was

found within each subset. Even more interesting is their observation that the


CHAPTER 4. REVIEW OF FORMER WORK 23

relative position of maximal preference inside each subset was fixed. That means

the preferred level of complexity was not absolute, but relative for the presented

stimuli.

Hargreaves and North point to two potential problems with respect to

Berlyne’s theory: the influence of the listener’s mood and intention when se-

lecting music, and the dependence on the appropriateness of the music for the

listening situation [NH97a]. It seems obvious that the same listener will prefer

different types of music whether he is driving a car, relaxing on the sofa, or

dancing in the club. So when they asked subjects to rate the preference for the

music in a cafeteria, they indeed found an inverted-U shaped relation between

complexity and preference. However, the effects were mediated by musical style.

Organ music of the same complexity level as New Age music was liked less in

this listening situation. In another study [NH97b] they show that preference and

pleasantness of music have to be distinguished. Preference and arousal potential

were again found to relate through the inverted-U curve in this study. But North

and Hargreaves argue that a subject with an antipathy against brass music for

instance will not be pleased by this music, whether it matches the optimal com-

plexity level or not. In this sense, optimal complexity only provides the potential

for a maximum of pleasure a subject might experience, but does not determine it

completely. Conversely, a subject might find pleasure in listening to music that

does not possess the optimal level of complexity. However, pleasure would reach

a maximum when the right level of complexity and a general liking coincide in a

piece of music.

A quite recent study by Orr and Ohlsson [OO05] addressed the dependency

of musical expertise on preference. They used natural stimuli, bluegrass and

jazz solos, at different complexity levels, which were purposely performed and

recorded for their experiments. Four different groups of subjects were asked to

rate perceived complexity and liking for the stimuli. One group consisted of

subjects with no musical training, a second one was composed of subjects with

moderate training in music, the third and fourth group consisted of professional

jazz and bluegrass musicians. The results reported by Orr and Ohlsson indicate

two things. First, it seems that the significance of complexity for preference

decreases with increasing musical expertise, unless complexity itself is learned

as an aesthetic criterion. This can be seen from the fact that for the group of

professional jazz musicians an inverted-U relationship could not be identified. For


CHAPTER 4. REVIEW OF FORMER WORK 24

the professional bluegrass musicians the inverted-U relationship only appeared

for their ratings of bluegrass music. The authors interpret this effect insofar

as complexity might represent an explicit aesthetic criterion in bluegrass music,

which has been learned by the professionals. The group of untrained subjects

was the one where the inverted-U became apparent the most for either musical

style. The moderately trained group revealed this effect only in case of the

bluegrass samples. So secondly, the importance of optimal complexity seems

also to depend on the music style. We have reviewed so far only psychological

studies using selected stimuli where the complexity was purposely varied in a

controlled test environment. It is also interesting to consider studies that where

conceived the other way around, that means observing preference indicators of

real music and then relating these with complexity estimations.

4.1.2 Musicological Studies

Eerola and North [EN00] report about their analysis of 128 Beatles songs, all

written by and for the Beatles in the years between 1962 and 1970. From MIDI-

transcriptions of the songs they extracted the melodies and analyzed the melodic

complexity with their expectancy-based model (see section 4.2). A highly sig-

nificant increasing trend was found for melodic complexity over the time period.

Secondly, the authors compared the complexity values with indicators of com-

mercial success of the songs and albums. The chart position and the time in the

charts were both negatively correlated with melodic complexity. So the higher

the complexity, the less popular were the songs. Although this is a clear sign of

relevance between complexity and popularity, the authors point out that other

factors of social, cultural, or commercial kind certainly have an influence as well.

Since they were not considered in the study, care has to be taken in drawing con-

clusions. However, they make reference to Simonton [Sim94a], who also found

a clear connection between melodic complexity and popularity. His results stem

from an extensive study of 15,618 melodic themes of classical music.

A study by Parry [Par04] comes to similar conclusions. He analyzed the

melodic and rhythmic complexity of 10 songs that were listed in the Billboard

Modern Rock Top 40 within the period from January to June of 1996. The

complexity was estimated using MIDI transcriptions of the songs and is based

on a very basic self-similarity measure. As indicators for the chart performance


CHAPTER 4. REVIEW OF FORMER WORK 25

the number of weeks in the charts, average weekly change in rank, peak ranking,

and debut ranking were considered. Parry found the number of weeks in the

charts being positively correlated with both, rhythmic and melodic complexity.

The former was also positively correlated with the peak ranking. For the aver-

age change in rank a negative correlation was found with melodic complexity,

indicating that higher melodic complexity inhibited rapid changes. The debut

ranking revealed no statistically significant correlation with the two complexity

measures.

4.1.3 Conclusion

As a conclusion we can say that a certain relationship between preference and

complexity of music cannot be denied. Based on the rather scarce evidence the

correlation, however, seems to be not as simple as we could guess from Berlyne’s

theory. Other factors, as mentioned, also show effects and might under certain

circumstances completely overrule the influence of complexity on preference. It

should therefore not be expected to have found the holy grail of music recom-

mendation with the usage of complexity descriptors. On the other hand, the

reported findings clearly prove the relevance of complexity in music listening,

especially for non-expert listeners. Providing complexity descriptors for music

therefore should be able to enhance human interaction with music collections,

which is the goal of the research presented here.

4.2 Existing Models for Melodic Complexity

4.2.1 The Implication-Realization Model

Already back in 1990 Eugene Narmour proposed a model for melodic complex-

ity. This Implication-Realization model, as he calls it, is extensively described

in [Nar90]. The model hierarchically builds up larger structures from smaller

elements and thus possesses different levels. The predictions are most clearly

specified on the lowest level, which is the tone-to-tone level. Any melodic inter-

val that is perceived as being “open” (incomplete sounding) is said to create an

implication on the listener’s side. That means it raises certain expectations in

the listener about how the sequence is supposed to continue. Figure 4.1 illus-


CHAPTER 4. REVIEW OF FORMER WORK 26

Figure 4.1: Illustration of the Implication-Realization Process (from [SAPM02]).

trates this process for the tone-to-tone level. The first interval implies a second

one to follow, while the relative thickness of the arrows indicates that different

continuations can be implied with different strength. Factors that contribute to

closure (i. e. the opposite of openness) at the single-interval (tone-to-tone) level

are [SAPM02]

� a longer duration of the second tone than the first one (e. g. eighth note
followed by quarter note)

� a higher stability of the second tone in the established musical key (e. g. ti
followed by do)

� a stronger metrical emphasis of the second tone (e. g. at the first beat of a
measure).

Frequent realization of these expectations reveals a low level of complexity; fre-

quent disappointment reveals a high level of complexity, because the structure

of the melody is hard to decode. It has to be pointed out that the development

of this model was guided by gestalt principles only, implying validity completely


CHAPTER 4. REVIEW OF FORMER WORK 27

independent from the listener’s cultural or musical background. Narmour him-

self states about the principles that they are “innate, hardwired, bottom-up,

brute, automatic, subconscious, panstylistic, and resistant to learning” (as cited

by [SAPM02]).

4.2.2 Enhancements of the Model

The IR-model inspired many experiments. It was reduced by Schellenberg to

its basic principles. He proved experimentally that his simplified version had

the same explanatory power as the original [Sch96]. Eerola and North report

in [EN00] about an experiment where they assessed an enhanced version of

the IR-model. They point out that the traditional information theorist view

of complexity does “not address the role of the listener’s perceptual system in

organising the structural characteristics of music”. Therefore they propose an

expectancy-based model (EBM) to estimate complexity. Their model for melodic

complexity is also inspired by Narmour, but they include some additional fea-

tures derived from a symbolic representation of the music. In their evaluation

they found that tonality (modified by metrical position and tone duration), the

intervallic principles registral direction and intervallic difference, and the rhyth-

mic principles syncopation and rhythmic variability showed significant capability

in predicting listeners’ complexity judgements. Comparing the prediction ac-

curacy of this model with an information-theoretic and a transition probability

model([Sim94b]), they found it to be the best one.

Eerola, Toiviainen, and Krumhansl conducted experiments showing, how the

accuracy of the model can be further improved by taking a larger melodic context

into account [ETK02]. Since the focus of the original model is limited to two

notes at a time only, it neglects the impact of the longer-term melodic evolution

(e. g. repetition of motives) on the listeners’ predictions of continuation. However,

their modifications were aiming more towards a real-time modelling of listeners’

continuation predictions than towards a more accurate estimation of the overall

complexity.


CHAPTER 4. REVIEW OF FORMER WORK 28

4.2.3 Conclusion

As a conclusion we can say that the problem of melodic complexity is mostly

solved already. With the extended model from Eerola we already have an auto-

matic complexity estimation for melodies that fits our needs. Eerola even made

the implementation of his and several other models available in the MIDI toolbox

for MATLAB [ET04]. However, there is one problem here. The model, as well

as all the other variants presented in this section, requires an accurate symbolic

representation of the melody in order to arrive at valid results. Usually, digital

music files do not have this symbolic description attached to them. The key

problem thus remains in the automatic extraction of the melody from the audio

stream in order to apply the existing complexity model in a second step to the

transcription. But current methods are still far from reaching such transcrip-

tions from arbitrary music tracks with a sufficient accuracy. The best algorithm

for melodic pitch tracking in polyphonic signals in 2004 ISMIR contest achieved

only an average accuracy of 75% [GSO+05]. This figure does not even include

note segmentation. However, only a few wrong notes in the melody completely

mislead the complexity estimation. So unfortunately it is currently not possible

to apply the existing models directly in the application context of our research.

4.3 Measures for Rhythmic Complexity

4.3.1 The PS-Measure

Shmulevich and Povel introduced in [SP00] the PS-measure for rhythmic com-

plexity. They state, with reference to [Ess95], that a listener tries to establish an

internal clock, when hearing rhythmic music. According to this clock the listener

then segments the rhythm pattern and tries to code the segments. So a rhythmic

pattern will appear complex, when either the induction of the internal clock is

weak or absent, or when the coding of the segments is difficult. So, the algorithm

combines the two aspects for the complexity estimation.

The clock induction strength is a weighted combination of the number of

clock ticks that coincide with silence and with unaccented events. This shows

already, that a rather high abstraction level is assumed here, since the clock

grid is assumed to be known and the musical events have to be classified into

accented, non-accented, and silent ones.


CHAPTER 4. REVIEW OF FORMER WORK 29

The coding complexity is estimated by assigning different weights to four

different types of segments that can appear when splitting up the sequence of

events according to the clock grid. The segment types are empty segments, equally

subdivided segments, unequally subdivided segments, and segments beginning with

silence. An additional weight exists that is added, when two consecutive segments

differ from each other. The coding complexity is then simply the sum of all the

weights for the whole sequence.

It must be noted that the clock induction strength and the coding complex-

ity when computed this way tend to give higher complexity ratings to longer

segments, because there is no normalization. This might be appropriate when

isolated rhythm patterns are evaluated, since a longer pattern is supposed to

have a higher complexity potential than a short one. However, for complete

music tracks some adaptation would be in need. Although rhythm pattern ex-

traction might seem an easier task to solve compared with melody extraction, to

date universal methods providing sufficient precision are still lacking (a recent

review can be found in [GD05]). The problem is that even the correct localiza-

tion of rhythmic events, which is already a challenge, still does not suffice for

the application of the PS-measure. In addition these events have to be seen in

relation to the rhythm grid, which also has to be established first from the audio

file.

4.3.2 Danceability

Recently, Jennings et al. published an article on musical genre classification

[JIM+04]. What makes this article stand out is the fact that the authors used,

exclusively, a feature which has not been considered in music information re-

trieval research so far. They refer to it as the “Detrended Variance Fluctuation

Exponent”, since it originates from a technique called “Detrended Fluctuation

Analysis” (DFA). The technique was introduced by Peng et al. [PBH+94] and

is intended for the analysis of chaotic time series. It was first applied by Peng

in a biomedical context on nucleotide and heartbeat rate time series. Other

applications include financial time series (as reported in [Aus00]).

The method intends to identify the scaling behaviour of a time series. This is

best explained by considering an unbounded time series (i. e. with no upper limit

for its values). If we plot a portion of such a series over the time axis, we will need


CHAPTER 4. REVIEW OF FORMER WORK 30

a certain range r1 of values on the y-axis in order to visualize all the elements

falling into the temporal segment t1. Imagine now that we are zooming in on

the plot. This zoom can be expressed as a factor on each axis, for example we

can say that the new portion has 0.5 times the length of the previous one, such

that τ2 = τ1/2. In this case we can expect that the range of values on the y-axis

has changed as well, from r1 to r2. By taking the quotient
log(r1/r2)
log(τ1/τ2)

we obtain a

scaling exponent α. This exponent can be calculated for different time scales (i. e.

for different values of τ1 and τ2). If we systematically increase or decrease the

time scales and for each step calculate the scaling exponent, we obtain a function

α(τ ). From this function we can get an insight into the scaling properties of the

time series. For example the scaling can be stable (constant level of α) or it

might expose irregularities. Also the general level of α is significant. For white

noise it is 0.5, for pink noise it is 1.0, and for brown noise it is 1.5. α levels below

0.5 indicate that the signal is anti-correlated ([Mat01] pp. 41–43).

However original the descriptor of Jennings et al. might be, it is not obvious

how it relates to the rhythmic complexity we are interested in. Since their paper

focusses on genre classification this aspect was not directly considered. But they

state that the strong periodic trends in dance music (as Techno or Brazilian

Forró) make it easily distinguishable from high art music by using this feature,

because the average α value is clearly lower for the former. “Jazz, Rock and Roll,

and Brazilian popular music may occupy an intermediary position between high

art music and dance music: complex enough to listen to, but periodic enough to

dance to,” they speculate [JIM+04]. Hence, we could think of this feature as a

measure of “danceability”, which certainly is an aspect of rhythmic complexity.

In the next chapter we will look at further details about first experiments with

this descriptor.

4.3.3 Conclusion

With the PS-measure we are facing once more the problem that an existing

model for complexity cannot directly be utilized for the intended applications. As

with the melodic complexity models already, we again are left with incompatible

representations of the music. Namely, the symbolic, abstract rhythm pattern is

needed as an input for the model, and the digitized waveform is what we have

at hand in the audio file.


CHAPTER 4. REVIEW OF FORMER WORK 31

The DFA exponent on the other hand is a feature that can directly be applied

to the audio file and does not rely on the extraction of other mid- or high-level

descriptors as a preprocessing step. Although its validity has not been proved

and it certainly does not cover all aspects of rhythmic complexity, it appears to

be an interesting and very feasible approach.

4.4 Harmonic Complexity

4.4.1 Preference Rule System

The theory of harmony in music has a very long tradition already. Still, in con-

trast to melodic complexity, no dominant and well-studied models for harmonic

complexity could be identified by the author. Research has been done on the

expectations evoked in listeners by harmonic progressions especially on the field

of classical music [Sch89]. It turned out, that listeners usually predict a chord

that results from a transition considered as common in the given musical con-

text. Yet, to our knowledge no tests have been carried out that correlated the

perceived harmonic complexity with the fulfilment or disappointment of these

expectations. Temperley supposes that the scores computed with his preference

rule system could reveal an estimate for the tension in music ([Tem01] section

11.5, pp. 307–317). The mapping of achieved scores would go from incompre-

hensible (breaking all rules) over tense to calm, and finally to boring (all rules

obeyed). Although he does not use the term complexity, this basically reflects

what we are looking for. He names four different aspects of this harmonic com-

plexity:

1. The rate at which harmonies change.

2. The amount of harmonic changes on weak beats.

3. The amount of dissonant (ornamental) notes.

4. The distance of consecutive harmonies in a music theoretical sense.

However, his system does not only need a transcription of the chords, it assumes

also that the metrical grid is known. So since this high level information is not

available in our case, the direct application of Temperley’s system is not feasible.


CHAPTER 4. REVIEW OF FORMER WORK 32

Furthermore, the fourth point has to be addressed carefully when extending

the scope from classical music to the different types of modern popular music.

Temperley himself considers this tension estimation to be consistent only for

particular styles (see [Tem01] figure 11.11 p. 314). He also refers to Lerdahl,

who in [Ler96] proposed an even more high-level approach to the estimation of

tension taking also larger structural levels into account. But since the application

of Temperley’s model is already problematic with musical audio signals, Lerdahl’s

proposal appears even less feasible.

4.4.2 Rewriting Rules

A different approach is taken by Pachet, who proposes the application of rewriting

rules [Pac99]. He addresses the effect of harmonic surprise in Jazz music. His

argument is that the “rich algebraic structure underlying Jazz chord sequences”

has to be considered when talking about expectation and surprise.

His model is based on two ingredients, a set of typical patterns, which he

relates to the characteristics of the musical style, and a set of rewriting rules,

according to which a given chord sequence can be transformed. The basic idea

is that even a chord sequence that has never been heard before might not be so

surprising to a listener who is familiar with these two sets that are sufficient to

generate it.

Instead of manually creating the two sets, Pachet applies machine learning

techniques in order to obtain them. He uses a string compression algorithm

[ZL78] in order to extract the typical patterns. In order to find the rewriting

rules he then uses a table method and utilizes the likelihood of occurrence of a

hypothesized rule as a basis for finding the best rewriting rules.

Although he is not directly aiming at a complexity estimation, the idea is not

too far from our needs. A highly predictable (little surprising) chord sequence

could be identified with a low harmonic complexity and vice versa.

4.4.3 Conclusion

The two ideas for harmonic complexity estimation presented above have two

drawbacks. First, as with the melodic complexity models already we here again

need a transcription of the chords first before we can start to analyze the com-


CHAPTER 4. REVIEW OF FORMER WORK 33

plexity. We can find many approaches towards chord extraction from musical

audio (see e. g. [Kla04], [Bel03]), but so far there are no satisfying solutions. Sec-

ondly, either one of the cited methods seems to be appropriate only for a limited

set of musical styles, the Preference Rule System more for classical music with

traditional rules from harmony theory, the Rewriting Rules more for Jazz music

where a diverse set of chord extensions is very common. So none of them is

optimal as a general approach to harmonic complexity, because we want to deal

with many different musical styles at the same time.

4.5 Other Related Work

In this section we want to take a short look at other approaches to music com-

plexity that take a slightly different angle in addressing the topic.

4.5.1 Information-based Complexity

Pressing in [Pre98] gives a very short outline of three different types of complex-

ity, which he names hierarchical, adaptive, and generative or Information-based

complexity. Referring to music the first is focussing on the structure of a com-

position. Pressing mentions Johann Sebastian Bach’s Kunst der Fuge as an

example where the composer exploits hierarchical complexity. “[N]otes function

as elements of linear processes operating at different time scales, and are at the

same time compatible with a vertically-oriented chordal progression process.”

The second type of complexity refers to the temporal changes in a musical work,

including aspects like the adaption to unpredictable conditions, or the anticipa-

tion of changes in themselves or in the environment. Here, Pressing mentions

improvisatory performance as an example.

The third type, the Information-based complexity, is elaborated in some more

detail by Pressing. It is inspired by Kolmogorov complexity, but focuses more

on the production of music through a human than on the generation of a string

through a computer program. Pressing acknowledges that a pure information

theoretic measure falls short in measuring music complexity, because it will al-

ways rate random sequences as the most complex ones, which does not go along

with human perception. He argues that humans process complexity by develop-

ing routines and heuristics that try to overcome the limitations of memory and


CHAPTER 4. REVIEW OF FORMER WORK 34

attention. His approach to complexity is based on the concept that these routines

have different difficulty levels, because some are easier to learn than others. So

conversely, the complexity of a stimulus is determined by the difficulty assigned

to the routines and heuristics that are needed to produce it.

He demonstrates his concept by estimating the complexity of six very short

rhythmical patterns. This is achieved by simply applying a processing cost func-

tion (cognitive cost) to the symbolic level attribute syncopation on quarter-note

and eight-note level. His approach shows some similarity with the coding com-

plexity used in the PS-measure for rhythmic complexity (section 4.3.1). Again,

the sequence of events is broken up into segments (this time according to two

different grids) and a cost is assigned to each segment depending on the type of

syncopation that can be found.

This approach is very interesting, because it seems a convincing combination

of information theoretic principles and cognitive effects. However, it is purely

theoretical and was not evaluated in any experiment with humans. Also there

is a crucial point in the identification of the routines and heuristics needed to

generate the stimulus, as well as in the assessment of cognitive costs to them. In

the case of a complex musical audio signal this is a very hard task to solve.

4.5.2 Complexity of Short Musical Excerpts

In [SWV00] Scheirer directly utilizes the statistical properties of psychoacoustic

features of short musical excerpts to model perceived complexity. The excerpts

were of only 5 s length, so abstract levels, like melodic complexity or even struc-

tural complexity, are not accessible. Consequentially, Scheirer only considers a

joint complexity of the excerpts and does not address individual facets.

The following low-level features were identified by him experimentally as the

most useful ones for the prediction of human complexity ratings:

1. coherence of spectral assignment to auditory streams

2. variance of number of auditory streams

3. loudness of the loudest moment

4. most-likely tempo

5. variance of time between beats


CHAPTER 4. REVIEW OF FORMER WORK 35

The computation of these features is extensively described in [Sch00] (chapters

4-6). Scheirer compared the mean complexity ratings of a group of 30 human

listeners to the output of his descriptors. He reports that, by using linear regres-

sion techniques on these, they were strongly significant in predicting the rated

complexities, roughly 20% of the variance in the mean complexity ratings was

explained by these five features (R2 = 0.186, p < 0.001).

If we compare this to our idea of different complexity facets, features one and

two can be assigned to timbral complexity, feature three to acoustic complexity,

and features four and five to rhythmic complexity. However, it is questionable

whether this approach is still successful when complete tracks are considered

rather than only short excerpts. As Scheirer states himself, he is more concerned

about complexity as a musical surface feature. He defines the musical surface to

be “the set of representations and processes that result from immediate, precon-

scious, perceptual organization of a acoustic musical stimulus and that enable a

behavioral response”. This musical surface might then help in timbre or genre

classification tasks.

4.6 Summary

In this chapter we have seen some evidence for the usefulness of complexity de-

scriptors for music. However, to develop a fully operational set of algorithms

for automatic complexity estimation from musical audio signals there are still

many open issues. While for the melodic and the rhythmic facet tested models

exist in the symbolic domain, they cannot be applied directly due to the lack of

reliable extraction algorithms. For harmonic complexity only theoretical sugges-

tions were found, while the timbral, the acoustic, and the structural facet have

not been addressed at all so far. In the following two chapters we will there-

fore proceed with a complete set of operational complexity definitions, a report

on first results with practical implementations, and the plans for future steps

towards a satisfying solution.


Chapter 5

Own Approaches

In this chapter some of the author’s own work on the topic is described. Several

of the approaches in section 5.1 have not yet been converted into fully functional

implementations, but are presented as a guideline for the direction of future

research. Hence, there is some overlap with the next chapter, which is explicitly

dedicated to Future Work.

5.1 Operational Definitions

5.1.1 Acoustic Complexity

Under Acoustic Complexity we want to capture two different aspects of a musical

audio track: the dynamic and the spatial properties. The former are related to

the loudness evolution throughout a musical recording, while the latter corre-

spond to the rendering of the auditory scene. From these explanations it is clear

already that Acoustic Complexity is not completely intrinsic to the music, but

rather to the recording and the performance. Nevertheless, we found it worth-

while to include this complexity facet, because digital music tracks only exist as

recorded performances1 of music. So it is not possible to listen to one without

noticing characteristics of the other.

1For electronic music the concepts of “recording” and “performing” have to be understood

in a slightly wider sense here.

36


CHAPTER 5. OWN APPROACHES 37

Dynamic Component

The dynamic complexity component relates to the properties of the loudness

evolution within a musical track. We refer to it in terms of abruptness and fre-

quency of changes in dynamic level. Also the dynamic range has to be considered

here, since a sudden jump by 6dB will be considered more surprising than one

by only 1dB. A major design decision for the complexity model is the definition

of the time scope. By keeping the frame size small one would find the distinction

between dynamically compressed and uncompressed material. The former would

have less rapid changes in loudness than the latter and would therefore be con-

sidered less complex. With longer windows one could detect fades and dynamic

changes between larger segments. The regularity of dynamic changes has to be

observed as well. An uncompressed drumloop for example will have many abrupt

changes in short-term dynamic level due to the sharp attacks in front of a very

low background loudness level. But because of its periodicity these will not be

found very complex compared to random (and thus unpredictable) drum hits, if

we expect the “good continuation” gestalt principle to hold.

The first step towards dynamic complexity computation is the calculation

of an accurate estimate of the perceived loudness. Psychoacoustics are a well

studied field and quite reliable measures exist in the literature for loudness es-

timation of isolated sounds, like sinusoids or noise ([ZF90],[MGB97]). For the

complex sounds that form a musical performance however it is a very compli-

cated task. Not only temporal and spectral masking have an influence here, but

also subjective components play a role [SN04]. Finally, the playback level of a

digital track can not be known, so the loudness can only be approximated.

Spatial Component

For the spatial complexity component we consider only stereo recordings and no

advanced multi-channel formats. Currently, two channel stereo tracks form by far

the majority of items in digital music file databases. A straight-forward example

for spatial complexity thus is the disparity of the stereo channels. A quasi mono

situation with similar channels reveals less complexity than a recording that

has only little correlation between the two channels. But also more advanced

aspects are to be considered, such as the movement of the acoustical center of

effect within the stereo image, and sound effects changing the spatial impression

(e. g. delay and reverberation effects).


CHAPTER 5. OWN APPROACHES 38

Yet, from the computational point of view, these advanced properties are

not trivial to measure. While methods exist to calculate the position of a single

sound source in space when it is recorded by an array of microphones [SYSPI05],

the identification of the center of effect in the stereo panorama is a somewhat

different problem. Since usually complex mixtures of sounds are involved in each

channel, some kind of source separation technique would need to be applied, but

the current state of the art in auditory scene analysis still does not provide the

necessary tools for that.

The measurement of reverberation on the other hand has a solid tradition

in room acoustics. An overview over several measures of spaciousness in room

acoustics can be found in [Gri99]. Usually, these measures take the room impulse

response as their input and are thus not suited for a continuous signal. An

exception is the InterAural Difference (IAD) introduced by Griesinger, which, as

he states, can also be computed as a continuous function for music signals. It is

computed according to equation 5.1, where L(t) and R(t) refer to the signal in

the left and the right channel, and the equalization eq consists of a low frequency

enhancement of 6dB per octave below 300Hz.

IAD = 10 · log10
(

eq(L(t) − R(t))2
L(t)2 + R(t)2

)
(5.1)

Still, it must be stated again that the originally intended use of these measures

is somewhat different from our needs. The measures are supposed to reveal

information about the acoustic properties of a real room when a sound is played

back inside. The two channels in equation 5.1 correspond to the signal recorded

by two microphones (e. g. inside an artificial head) in the room. We, on the

contrary, already have a stereo recording, which was produced with very different

techniques, but would simply have to treat it as the recording of the dummy

head in a “virtual” room. An extensive discussion of this problem and some

experimental results can be found in [Mas02].

5.1.2 Timbral Complexity

There is no clear and precise definition of timbre that could be regarded as a

common agreement on the music analysis field. By the American Standards As-

sociation ([Ass60] p. 45) the following statement was released: ”[Timbre is] that

attribute of auditory sensation in terms of which a listener can judge that two


CHAPTER 5. OWN APPROACHES 39

sounds similarly presented and having the same loudness and pitch are dissim-

ilar.” For our purpose we think of timbre as the entity that is the most tightly

knitted with sound production (i. e. the source of the sound and the way this

source is excited). However, this concept of source should be considered not in a

strictly physical sense here. In the case of a group of 20 violinists playing unison

for example we would rather refer to the group as the sound source rather than

to every individual violin.

We then can derive several specifications of the general ideas of complexity

itemized in section 3.1.3. This gives us features like the number of distinguishable

instruments or sound textures present in the music, the rate at which the leading

instruments change, or the amount of modulation of the sound sources. As

reported in [HPD03], source separation and instrument recognition systems for

arbitrary polyphonic music signals are not yet available. Nevertheless, since the

exact classification of individual instruments is not necessary for our purposes,

machine-learning and clustering techniques might be applicable. Aucouturier and

Sandler report about applying HMMs for music segmentation [AS01] in a manner

that might be useful for timbre complexity estimation as well. If a finite set of

timbre states can be assigned to the music, it is also possible to apply techniques

from information theory like entropy estimation or compression algorithms. For

example the LZ77 algorithm by Ziv and Lempel [ZL77] could be a good choice,

since its limited memory buffer resembles in a simplified way human short term

memory when processing a musical input [Sny00].

5.1.3 Melodic Complexity

Despite the fact that very well studied models are already available for the assess-

ment of melodic complexity based on a symbolic representation, when it comes

to musical audio this can be considered the most difficult of the facets. This is

due to melody being a very abstract description which is hard to access from

the audio signal. As pointed out in section 4.2.3 there is no extraction algorithm

available yet, that can reliably transcribe the melody of any musical audio file.

One strategy could be to focus on melody related features that are more

feasible to extract and to try then the development of a new model that can ap-

proximate melodic complexity based only on these features. A possible candidate

would be for example the harmonic pitch class profile (HPCP) [Gom04]. How-


CHAPTER 5. OWN APPROACHES 40

ever, this is not a very straight-forward approach, because the HPCP contains

information about all tonal content, not only about the melodic voice. Further-

more the octave information is lost and to arrive at something close to a note

segmentation is still a long way to go from that representation. A valid solution

thus would be to focus instead on the melody extraction task and to work to-

wards a useable melody transcription in order to be able to apply the already

available complexity models. Yet, this is a PhD topic by itself and will not be

addressed within the scope of the research presented here.

5.1.4 Rhythmic Complexity

For rhythmic complexity we find a similar picture as for the melody facet. With

the PS-measure a model exists already that has been assessed at least prelimi-

nary. But again, the model operates on an abstraction level which is not easy to

achieve when starting from the audio signal. In this case algorithms exist that

are already closer to a reliable extraction than in case of melody extraction (e. g.

[DBDS03], [Kla03]), but still the results are not perfect.

Despite being a high-level concept, the rhythm is not as abstract as the

melody. Therefore for rhythmic complexity estimation it is more promising to

find a circumvention for the explicit and complete transcription than in case of

melodic complexity. The danceability descriptor based on the detrended fluctua-

tion analysis is one example (section 4.3.2). Otherones are the rhythmic features

used by Scheirer for the complexity estimation of short musical excerpts (section

4.5.2). It seems better to consider the variances in the relative positions of ex-

tracted rhythmic events than to rely on their absolute position with respect to

a metric grid which also has to be determined first. The extraction is already

error-prone and the determination of the grid, based on the extracted events,

introduces a second source of error. This second error is much more severe for

the complexity estimation, since all events are evaluated with respect to the grid.

Several wrongly estimated onsets in contrast will not affect the global result so

drastically.


CHAPTER 5. OWN APPROACHES 41

5.1.5 Harmonic Complexity

In section 4.4.3 we discussed already the two drawbacks of the presented methods.

As with rhythmic complexity it seems more advantageous to explore alternative

options that avoid the need for a precise transcription. A possibility would be the

application of the HPCP that we mentioned in section 5.1.3. As explained, this

feature is strongly related with the harmonic content of the music and therefore

an attractive candidate for our purposes. Several approaches towards chord seg-

mentation and recognition rely on this or similar features (e. g. [Fuj99], [SE03]).

It is also interesting from the perceptual point of view, because the concept of

pitch classes (i. e. the association of frequencies with the twelve pitch chroma

steps in western music) is an important step in the process of harmonyic percep-

tion.

A common and important problem (among several others) in chord recogni-

tion is the detection of the chord boundaries. However, since we are not inter-

ested in a transcription, it might be possible to avoid this step in the harmonic

complexity estimation. Instead, the harmonic content represented by the HPCP

profile could be analyzed continuously. It could be mapped into the spiral array

proposed by Chew that defines a three-dimensional space of harmonic instances

[Che03]. The spiral array has the property that the spatial proximity reflects also

musical proximity of harmonies to some extent. As the song evolves, the path

through this space could be recorded and then analyzed. So the average speed

of movement within this space could be taken as an indicator for the harmonic

complexity.

A second aspect is the clarity of the harmonies that appear in the music.

In the context of key estimation Gomez defines the tonal strength through the

strength of the correlation between the mean HPCP profile for the track and

the closest key profile [GH04]. In a similar manner she uses the correlation of

instantaneous HPCP-vectors with prototypical chord profiles for a chord strength

measure. Both measures are related with our concept of harmonic complexity

and could be combined with the spiral array approach.


CHAPTER 5. OWN APPROACHES 42

5.1.6 Structural Complexity

Musical structure forms one of the highest levels of abstraction in content analy-

sis. It is unique compared to the other facets in the sense that all of them are

potentially relevant for its computation. We want to refer to structure on a

rather macroscopic level (i. e. in terms of intro, verse, and chorus rather than

motive or theme).

Along with our remarks in section 3.1.3 we can identify attributes of structural

complexity such as the number of distinguishable parts, or the level of periodicity

of their appearance. It would also be desirable to estimate the dissimilarity of

consecutive parts. Very contrasting parts following each other would be surpris-

ing and thus probably enhance the perceived complexity.

Once more, we have to face the fact, that before we can perform any structural

complexity processing, first the structure itself has to be extracted. Various ap-

proaches to this problem have been taken and are still explored (see e. g. [CV03],

[SBM+02]). The general purpose solution has yet to be found.

5.2 Implementations

Following the review of computing different complexity facets of musical audio,

in this section two implementations by the author are described in detail. Both

have been programmed in MATLAB and were tested on a large music collection.

The evaluation results are reported in section 5.3.

5.2.1 The Dynamic Component of Acoustic Complexity

As mentioned in section 5.1.1, the computation of instantaneous loudness is a

key issue for estimating the dynamic component of acoustic complexity. For

the reason of computational efficiency the implementation described here uses a

simplified loudness model that was described by Earl Vickers in [Vic01]. Some

modifications and additions have been made in order to make the algorithm fit

for the desired task.

As a first step the algorithm applies a very simplified “B” weighting function

to the audio signal in order to account for the human ear’s transfer function. For

efficiency reasons this weighting is done by a first-order Butterworth high-pass


CHAPTER 5. OWN APPROACHES 43

filter with a cut-off frequency of 200 Hz. So actually only the low end of the

weighting function is approximated. In the case of music signals this is tolerable,

since they have much more energy in the bass and mid-range compared to the

high frequencies.

The pre-emphasized signal xpre(n) is then fed into a root-mean-square level

detector with an integrated smoothing function. This level detector consists of

a running average Vms (eq. 5.2) that is downsampled according to the chosen

frame length N .

Vms(n) = c · Vms(n − 1) + (1 − c) · x2pre(n) , with (5.2)
c = e

− 1
τ Fs

Fs corresponds to the sampling frequency and τ is the time constant for the

smoothing (35 ms in this implementation). The downsampling is done according

to equation 5.3. We chose a framesize of 200 ms corresponding to N = 8820

for a sampling rate of 44.1 kHz. This time span roughly resembles the energy

integration function of the human hearing system.

Vrms(i) =
√

Vms(N · i + N − 1) (5.3)

The instantaneous level is then converted to dB by calculating

VdB = 20 · log10 (Vrms(i)) . (5.4)

In order to avoid silence at the end or the beginning of the track to have an effect

on the complexity estimation, successive frames with a level below -90 dB are

deleted when they appear at either end. Afterwards, the global loudness level L

according to Vickers is calculated. L is a weighted average of all M instantaneous

level estimates, where the louder ones are assigned a higher weight:

L =
M−1∑
i=0

w(i) · VdB(i) , with (5.5)

w(i) =
u(i)∑M−1

j=0 u(j)
and

u(j) = 0.9−VdB (j)

The emphasis on the loud frames is grounded in psychoacoustic findings. So

for example Zwicker and Fastl [ZF90] suggest that the loudness of a dynamically

changing sound can be characterized by the loudness level which only 5% of the


CHAPTER 5. OWN APPROACHES 44

0 20 40 60 80 100 120 140 160 180
−60

−50

−40

−30

−20

−10

L
o

u
d

n
e

ss
 in

 d
B

a) Heavy Metal

0 20 40 60 80 100 120 140 160 180 200
−60

−50

−40

−30

−20

−10

L
o

u
d

n
e

ss
 in

 d
B

b) Pop Song

0 20 40 60 80 100 120 140 160 180 200
−60

−50

−40

−30

−20

−10

L
o

u
d

n
e

ss
 in

 d
B

c) Jazz Guitar Solo

0 40 80 120 160 200 240
−60

−50

−40

−30

−20

−10

time in s

L
o

u
d

n
e

ss
 in

 d
B

d) Classical Symphony

Figure 5.1: Instantaneous loudness (dots), global loudness (grey solid line), and average

distance margin (dashed lines) for four example tracks.


CHAPTER 5. OWN APPROACHES 45

frames exceed. In this implementation a variation of Vickers dynamic spread is

used as the final dynamic complexity measure. It is simplified in the sense that

periodic loudness variation and the suddenness of changes are not considered.

Instead it comprises basically the mean distance from the global loudness level

(eq. 5.6) so that high values correspond to higher complexity and vice versa.

Cdyn =
1

M

M−1∑
i=0

|VdB(i) − L| (5.6)

Figure 5.1 shows the computational results of the algorithm for four prototyp-

ical music tracks. In figure 5.1 a we see a highly saturated heavy metal track with

a basically flat dynamic level. The global loudness was estimated as -14.9 dB and

the average deviation from this level is only 1.37 dB. Figure 5.1 d shows another

extreme, a recording of classical music that changes in loudness between -50 dB

and -20 dB. The algorithm estimates a clearly lower global loudness of -24.8 dB

and an average deviation of 8.05 dB. In between there is a pop song (figure 5.1 b)

moderately varying in loudness with a global level of -16.4 dB and an average

deviation of 3.71 dB. The jazz guitar solo from figure 5.1 c is recorded at very

low volume. The estimated global loudness is only -25 dB. Despite some singular

drops down to -50 dB due to short pauses between played notes, the average

deviation amounts only to 4.5 dB and is thus considerably smaller than that of

the classical recording.

5.2.2 Danceability

The implementation described here is following the description by Jennings et

al. in [JIM+04]. When exact specifications were missing, the author tried to

integrate reasonable solutions. The experimental findings obtained with this

implementation on a large music database where reported at the 118th AES

Convention [SH05].

As a first step the audio signal is segmented into non-overlapping blocks

of 10 ms length. For each block the standard deviation s(n) of the amplitude

is computed. The values s(n) resemble a bounded, non-stationary time series,

which can be associated with the averaged physical intensity of the audio signal

in each block (see figure 5.2). In order to obtain the unbounded time series y(m),


CHAPTER 5. OWN APPROACHES 46

0

0.5

1
classic

0

0.5

1

In
te

n
si

ty
 (

n
o

rm
a
liz

e
d

)

techno

0 5 10 15
0

0.5

1

time in s

pop

Figure 5.2: Excerpts from the time series s(n) for three example pieces from different musical

genres.

s(n) is integrated:

y(m) =
m∑

n=1

s(n) (5.7)

This integration step is crucial in the process of DFA computation, because

for bounded time series the DFA exponent (our final feature) would always be 0

when time scales of greater size are considered. This effect is explained in more

detail in [Pen05].

The series y(m) can be thought of as a random walk in one dimension. y(m)

is now again segmented into blocks of τ elements length. This time, we advance

only by one sample from one block to the next in the manner of a sliding window.

There are two reasons for this extreme overlap. First, we obtain more blocks from

the signal, which is of interest, since we will obtain better statistics from a larger

number of blocks. Secondly, we avoid possible synchronization with the rhythmi-

cal structure of the audio signal, which would lead to arbitrary results depending

on the offset we happen to have. However, performing the computation in this


CHAPTER 5. OWN APPROACHES 47

manner the number of operations is increased enormously.

From each block we now remove the linear trend ŷk and compute D(k, τ ), the

mean of the squared residual:

D(k, τ ) =
1

τ

τ−1∑
m=0

(y(k + m) − ŷk(m))2 (5.8)

We then obtain the detrended fluctuation F (τ ) of the time series by comput-

ing the square root of the mean of D(k, τ ) for all K blocks:

F (τ ) =

√√√√ 1
K

K∑

k=1

D(k, τ ) (5.9)

As indicated, the fluctuation F is a function of τ (i. e. of the time scale in

focus). The goal of DFA is to reveal correlation properties on different time

scales. We therefore repeat the process above for different values of τ that are

within the range of our interest. Jennings et al. [JIM+04] use a range from 310 ms

(τ = 31) to 10 s not specifying the step size in their paper. Relating these time

scales to the musical signal they are reaching from the beat level through the bar

level up to a level of simple rhythm patterns.

The DFA exponent α is defined as the slope on a double log graph of F over

τ (eq. 5.10) as shown in figure 5.3 for the three example tracks. It therefore

makes sense to increase τ by a constant multiplication factor rather than a fixed

step size. Apart from giving equally spaced supporting points on the logarithmic

axis it also reduces the computational operations without affecting the accuracy

gravely. We chose a factor of 1.1 giving us 36 different values for τ covering time

scales from 310 ms to 8.8 s.

For small values of τ an adjustment is needed in the denominator when com-

puting α (see [BGH+95]) giving us the following formula for the DFA exponent:

α(i) =
log10 (F (τi+1)/F (τi))

log10 ((τi+1 + 3)/(τi + 3))
(5.10)

As τ grows, the influence of the correction becomes negligible. In case that the

time series has stable fractal scaling properties within the examined range, the

double log graph of F over τ is a straight line making α(i) a constant function.

We find a constant value of 0.5 for a completely random series (white noise),

a value of 1 for a series with 1/f -type noise, and 1.5 for a Brown noise series

(integrated white noise) [Pen05].


CHAPTER 5. OWN APPROACHES 48

0.3 0.5 0.7 1.2 2 3 5 8
−1.5

−1

−0.5

0

0.5

time scale in s

lo
g

1
0(

F
)

classic

techno

pop

Figure 5.3: Double logarithmic plots of mean residual over time scales.

0.3 0.5 0.7 1.2 2 3 5 8
0  

0.5

1  

1.5

time scale in s

D
F

A
 e

xp
o

n
e

n
t 

 α

classic

techno

pop

Figure 5.4: DFA exponent functions for the three example tracks from figure 5.2.


CHAPTER 5. OWN APPROACHES 49

For music signals normally we do not have stable scaling properties (see figure

5.4). Unlike heart rate time series, for example, there is much more variance in

α(i) for music. Nevertheless, we can find that music with sudden jumps in

intensity is generally yielding a lower level of α(i) than music with a smoother

varying series of intensity values. Thus, music with pronounced percussion events

and emphasized note onsets shows lower α values than music with a more floating,

steady nature. There is also a relationship between strong periodic trends and

the α function. Figure 5.4 shows the evolution of α over the different time scales

for three musical pieces. As can be seen, the DFA exponent varies significantly

within each single piece.

The most stable scaling behavior is found for the classical piece at short time

scales, in contrast, the pop piece shows an intermediate, and the techno piece

shows a high instability. This is due to the presence of a strong and regular beat

pattern in the two latter cases (see figure 5.2). In the techno piece the periodic

beat dominates the intensity fluctuation completely since intensity variations on

larger time scales are negligible in comparison. This strong periodic trend de-

teriorates the scaling properties of the series and causes α to drop significantly.

Towards larger time scales however, the influence of the periodic intensity varia-

tion fades off and α raises back towards its normal level. In the pop music piece

there is also a regular beat, but it is less dominant than in the techno piece. As

can be seen in figure 5.2, there are also some noticeable changes in intensity on

a larger time scale. Still, α is clearly decreased by the periodic trend. Towards

larger time scales, we can observe the same effect as in the techno piece. For the

classical piece no dominant, regular beat pattern can be identified in the time

series. Thus, the scaling properties are not affected in the corresponding range.

But in contrast to the other two examples the series reveals a larger scale pattern

in some parts, which can also be seen in figure 5.2. This causes α to drop in the

upper range.

In order to arrive at an indicator for the danceability and thus a certain aspect

of the rhythmic complexity the α values have to be further reduced. While differ-

ent ways are thinkable to do this, in the first implementation simply the average

α level was computed for each track. A high value refers to a high complexity

(not danceable), a low value refers to a low complexity (highly danceable).


CHAPTER 5. OWN APPROACHES 50

5.3 Evaluation

The evaluation of complexity descriptors is not easy. The ideal condition for eval-

uation is of course a solid ground-truth annotation against which the performance

of the extraction algorithms can be measured. Unfortunately these ground-truth

annotations are rather troublesome to come by. Especially for complexity we

have the problem that, other than for genre or tempo for example, available an-

notations usually do not cover this issue. The manual annotation of one’s own

testing material needs a lot of time and resources and might not be feasible in

some cases.

The two descriptor implementations described above were evaluated by sub-

jective means on the basis of a large music collection. This was done by randomly

picking tracks at different complexity levels and judging the danceability and the

dynamic complexity in direct comparison by listening. A formal user study hwas

not been carried out yet.

For the danceability descriptor a more objective evaluation was done. Gen-

eral statistical methods and machine learning methods were applied in order to

explore relations between the semantic labels, or certain artists and the DFA ex-

ponent. The rationale behind this is to prove a systematic variation of the DFA

exponent subject to certain semantic attributes assigned to the music. More

details about the methodology and the findings are provided in the following

sections.

5.3.1 The Dataset

The two descriptor implementations described above were computed on a large

music collection. A data set of 7750 tracks from MTG-DB [CKF+04], a digital

music collection from the MTG lab, was used in the experiment. Each track

refers to a full piece of music. The dataset also contained annotated semantic

labels for each item, which were obtained from music experts and afficionados,

and had been manually assigned to the tracks. In our experiments we used the

artist names and also “tone” labels consisting in abstract attributes that are

associated with the music, such as “Rousing”, “Sentimental”, or “Theatrical”.

The list of “tone” labels is composed of a total of 172 different entries. In

the statistical analysis only a subset of 136 labels were considered, because the


CHAPTER 5. OWN APPROACHES 51

Figure 5.5: Top-eight labels with the highest number of assigned artists.

remaining ones appeared less than 100 times each. It must be noted that these

labels are originally assigned to the artists and not to the individual tracks.

Therefore a certain degree of fuzziness has to be accepted with these descriptions

when working on the track level. The data set contained very different, mostly

popular styles of music from a total of 289 different artists. A maximum of 34

labels were assigned to a single artist, while the average was 11 labels per artists.

Figure 5.5 shows a bar plot of the eight labels that were assigned to the highest

number of artists. The average number of artists sharing a label was 18.

5.3.2 Results

By manual random checks it was found that the complexity estimations at the

extreme ends were the most consistent ones. Comparing the tracks from these

regions with each other and with the intermediate ones the underlying concept

of the descriptors immediately became apparent. This effect can be easily seen

in figure 5.6, where the 60 highly danceable techno music tracks can be almost

perfectly separated from the 60 non-danceable film score tracks only by consid-

ering their average α value. For the dynamic complexity the low end contained

tracks that were either highly saturated or simply low quality recordings, while at

the other end we found mostly high art music recordings and professionally mas-

tered tracks. The fine grain ranking within a local region however did not appear


CHAPTER 5. OWN APPROACHES 52

Figure 5.6: α-levels for 60 techno (o) and 60 film score tracks (x), unordered.

comprehensible in many cases. This was especially noticeable in the very dense

area of intermediate values. So rather a coarse classification into 3–5 complexity

levels than a continuous ordering of the entire collection was achieved. This is

a comfortable number of discrete classes for a semantic descriptor considering

human memory capabilities.

The results of the statistical tests for the danceability descriptor sustain the

findings from manual random evaluation. Strong coherence of high statistical

significance was found for several of the “tone” labels that are semantically close

to the concept “danceable” or “not danceable” respectively. For example the

labels “Party/Celebratory” and “Energetic” in the context of music have a clear

relation with danceability, whereas “Plaintive” and “Reflective” appear more

appropriate descriptions for music that is not well suited for dancing.

The results reveal a consistency on a high abstraction level even exceeding

the aspect of danceability. Figure 5.7 shows how the distribution of some labels

on the deciles starting from the lowest to the highest α values in the collection.

A strong skew is apparent here with certain labels being highly over-represented

either in the highest or the lowest deciles.


CHAPTER 5. OWN APPROACHES 53

          
20

70

120

170
Clinical

          
50

100

150

200

250
Hypnotic

          
50

100

150

200

250
Detached

          
20

60

100

140
Wintry

          
50

100

150

200
Wistful

T
ra

ck
s 

in
 d

e
ci

le

          
20

60

100

140
Party/Celebratory

          
50

100

150

200
Romantic

          
100

150

200

250
Reflective

1 2 3 4 5 6 7 8 9 10
20

45

70

95

120
Plaintive

1 2 3 4 5 6 7 8 9 10
20

45

70

95

120
Ominous

Decile number
1 2 3 4 5 6 7 8 9 10

100

150

200

250

Intimate

1 2 3 4 5 6 7 8 9 10
20

40

60

80

100
Autumnal

Figure 5.7: Distributions on deciles for the twelve labels with most significant deviation from

equal distribution (solid horizontal lines).

The distribution of α values on the whole collection was normal with a mean

of 0.863 and a standard deviation of 0.022. The tracks assigned to each label

were tested for significant deviations from this distribution with the generalized

t-test (eq. 5.11). Only those with normal distributions were considered. Of

these, 24 showed a significantly higher and 35 a significantly lower mean value.

−2.58 < 0.863 − ᾱlabel√
0.0222

7750
+

σ2
label

nlabel

< 2.58 (5.11)

The value ᾱlabel is the mean α value for the considered label, σ
2
label is the corre-

sponding variance, and nlabel is the number of tracks having this label assigned.

Table 5.1 shows the ten labels that yielded the highest significance in their devi-

ation from the global distribution in either direction.

When looking at the two lists of labels a certain affinity can be noted in many

cases on either side. The group of labels for higher α wakes associations of Soft-


CHAPTER 5. OWN APPROACHES 54

Label ᾱ n Label ᾱ n

Party/Celebratory 0.796 706 Romantic 0.911 1399

Clinical 0.761 489 Wistful 0.909 1308

Hypnotic 0.804 951 Plaintive 0.922 659

Energetic 0.820 898 Reflective 0.901 1805

Visceral 0.808 422 Calm/Peaceful 0.908 1102

Trippy 0.824 998 Autumnal 0.916 604

Outrageous 0.781 102 Intimate 0.897 1709

Exuberant 0.839 1383 Stately 0.908 730

Irreverent 0.830 657 Gentle 0.892 1327

Sparkling 0.790 116 Elegant 0.886 2506

Table 5.1: The ten most significantly deviating labels in each direction.

ness, Warmness, Tranquility, and Melancholy. For the others we might form two

subgroups, one around terms like Exuberance and Vehemence, the other around

Tedium and Coldness. Comparing labels from both lists with each other, we

can identify several almost antonymous pairs, for example: “Clinical” – “Inti-

mate”, “Outrageous” – “Refined/Mannered”, “Boisterous” – “Calm/Peacefull”,

“Carefree” – “Melancholic”.

In the machine learning experiments two artist classification tasks were tried.

It must be stated again here, that the “tone” labels mentioned above originally

also belong to the artists and thus only indirectly to the individual tracks. In a

two class decision experiment we used 238 Frank Sinatra tracks and 238 tracks

from nine other artists who either had the labels “Brittle” or “Outrageous”

assigned to them. For the artist “Sinatra” a total of 18 labels were listed in our

data set, among them “Romantic”, “Calm/Peaceful”, and “Sentimental”. From

the results of the statistical analysis we would expect the Sinatra songs to be

distributed around a greater value of α than the other ones. The classes should

therefore be separable up to a certain degree. It must be noted, that among the

nine selected artists we find also assigned labels like “Whistful” and “Reflective”,

which are linked to higher α values as well (see table 5.1). The classification with

a decision table yielded a success rate of 73%, which is clearly above the 50%

chance level.

In a second experiment we used three different classes: 108 tracks composed

by Henry Mancini, 65 tracks composed by Bernard Herrmann, and 52 tracks


CHAPTER 5. OWN APPROACHES 55

of dance music with a strong beat. We purposely did not use the labels for

selecting the artists in this case. Mancini and Herrmann were both film music

composers, but while Herrmann mainly uses ”classical” orchestration, Mancini

often arranges his music in a jazz–like style. We had to select dance music

from different artists, because there was no single one with a sufficient number

of tracks in the collection. In terms of the DFA exponent we would expect to

find the highest values associated with Herrmann’s music, because it is the least

danceable in general terms. Intermediate values can be expected for Mancini’s

tracks, since there are many which at least possess a pronounced beat. The

lowest values should be found for the dance music, which has strong and regular

beat patterns. With a success rate of 67% the classification reached again clearly

above chance level (48%). Furthermore the confusion matrix shows exactly the

expected situation:

Predicted class

True class Herrmann Mancini Dance

Herrmann 41 23 1

Mancini 18 84 6

Dance 1 25 26

While the classes “Herrmann” and “Dance” are almost perfectly separable,

“Mancini” takes an intermediate position showing a considerable overlap with

both neighboring classes. The decision table identified the optimal thresholds

between the classes to be αHM = 1.0 and αM D = 0.7.

5.3.3 Concluding Remarks

Summarizing we can say that there a is strong evidence for the danceability

descriptor being related with semantic properties of the music tracks it was com-

puted on. From our experiments and observations the hypothesis put forward

by Jennings et al. seems to be valid, and the DFA exponent can be considered

a good indicator for the danceability of music. However, this should be seen

rather in the broad sense, classifying music into a small number of categories

from “extremely easy” over “moderately easy” to “very difficult” to dance to.

Currently, a fine grain ordering by the DFA exponent inside such a category

is not beneficial. Due to subjectivity effects such an ordering might not prove

useful anyway. By averaging the DFA exponent function we used an extremely


CHAPTER 5. OWN APPROACHES 56

compact and simple representation in the experiments. It should be possible to

improve the results by a more sophisticated reduction of the function α. It can

further be concluded from our results, that the DFA exponent shows to be mean-

ingful also in revealing even higher level attributes of music. It thus might form a

valuable addition for different music classification tasks, like artist identification

or musical genre recognition.


Chapter 6

Future Work

We have seen in the preceding chapters that for a complete solution to the prob-

lem of automatic complexity estimation many parts are still incomplete or miss-

ing. In this chapter we will describe a more detailed road map of planned and

possible future research. In several cases the author already has done initial tests,

which can be used as a basis for further exploitation. We will first look at the

tasks that are to be covered by the PhD thesis and then list a few issues that are

out of scope for the thesis, but still worthwhile to be explored scientifically.

6.1 PhD Thesis

6.1.1 Filling the Gaps

With the algorithms described in section 5.2 we already have partly covered

two complexity facets: the rhythmic and the acoustic one. Nevertheless, the two

algorithms are not yet in their final version and might be further enhanced during

the research work for the PhD thesis. In a similar manner as in section 5.1 we

will now go through the different complexity facets while putting emphasis on

the concrete implementation steps.

Acoustic Complexity

The current implementation of the dynamic component already captures a cer-

tain part of what was listed as relevant in chapter 5. Lacking so far are the

aspect of abruptness in loudness changes and the consideration of periodicities.

57


CHAPTER 6. FUTURE WORK 58

A possible way to address this would be to perform a linear prediction on the

loudness envelope and to estimate the average prediction error as a measure of

complexity. Also, the derivative of the loudness envelope could be considered

for further analysis. Furthermore it has to be tested in which way the different

values are finally combined, because the goal is only a single number reflecting

the dynamic component of acoustic complexity.

Considering the spatial component we already made reference to Griesinger

[Gri99] and the InterAural Difference measure (eq. 5.1 on page 38). A second

option is a measure based on the magnitude squared coherence (MSC), which is

proposed by Wittkop ([Wit01] chapter 4) as a method for diffusiveness estimation

of acoustical situations with binaural hearing aids. The MSC for a two–channel

signal (channels x and y) was originally proposed by Allen et al. [ABB77] as

M SC(f, n) =

(
|Φxy(f, n)|√

Φxx(f, n)Φyy(f, n)

)2
, (6.1)

where Φ denotes the smoothed Fourier Transform of the correlation function1,

and f and n are the frequency and the time-frame indexes of the short-term

spectra derived from the signal. The smoothing is done by a first order low–pass

filter along the time dimension. Wittkop proposes to collapse the frequency res-

olution to critical bands and introduces weighting and transformation functions

in order to arrive at a compact and stable long-term coherence estimation.

Rhythmic Complexity

With the danceability descriptor we have covered already one part of the rhyth-

mic complexity facet. As mentioned, the danceability estimation can still be

optimized, especially with respect to the intermediate range. One approach is

the optimization of the parameters in the algorithm, mainly the step size and

the range of the time scales τ . Another option is a more detailed analysis of the

α function. Instead of simply averaging all the values there might be more in-

telligent ways to reduce the representation and arrive at a danceability estimate.

Also different pre-processing can be considered as an option. The algorithm

could be applied in different frequency bands. Instead of the intensity envelope,

a different time series, which is related to rhythmical events might be used (e. g.

1The correlation is computed in frequency domain, while the final step, the inverse trans-

formation back to time domain, is left out (see [PFTV92] pp. 545–546).


CHAPTER 6. FUTURE WORK 59

the synthesized output of an onset detector).

Apart from the danceability estimation, which operates originally on a very

low abstraction level, it would probably be beneficial to explore approaches that

take a simple but perceptually relevant form of a rhythmic transcription as an

input. While the PS-Measure is demanding already a too abstract representation

in form of the notation of isolated rhythm patterns, other methods could be

applied on an intermediate level. Assuming a reasonably reliable onset detection

algorithm as a preprocessing unit it would be interesting to explore the variation

of the inter-onset intervals. Also an entropy coding method can be applied to the

extracted onset series, presumably best after a temporal quantization. Recurring

onset patterns would then decrease entropy and indicate a lower complexity.

Timbral Complexity

Entropy coding could also be the solution for the timbral complexity estimation.

At least after some initial experiments with unsupervised clustering and Hidden

Markov Models, the methods from information theory appeared the most promis-

ing in comparison. The key problem here is to find the right audio features and

to transform them into a string of discrete symbols from a finite alphabet. Since

the amount of features that are “somehow related” to timbre is enormous, the

difficulty lies in selecting the combination that best resembles the human per-

ception of timbre. At the same time the selection should cover all perceptually

relevant parts and remain very compact in order to keep the resulting alphabet

size reasonable. For the quantization process (which is unavoidable, because the

timbre features will be continuous) a hysteresis function has proved to be useful

in initial experiments. By having different thresholds for the upward and down-

ward transition between two partitions the effect of an oscillation for values close

to the borderline is avoided.

A systematical problem with timbral complexity is further the one of domi-

nant singing voice or speech. In terms of signal analysis a singer who produces

different vowels is also producing different timbres. In fact, vowels (or in general

phonemes) can only be differentiated by their different timbres. On the other

hand, a human listener who clearly perceives the different timbres will not realize

it, because he recognizes that they encode speech. We can illustrate this effect, if

we think about estimating the timbral complexity of a monotonous voice reading


CHAPTER 6. FUTURE WORK 60

a text and for example of a filtered sawtooth waveform where the filters cutoff

frequency is slowly oscillating. A näıve human will tend to attribute the higher

complexity to the latter in direct comparison, since the frequent timbre changes

in the former are not explicitly recognized as such. For a computer this is difficult

to achieve, because instead of speech the former is a sequence of very different,

fast changing timbres, whereas the latter is very predictable and only moderately

changing. By using a low temporal resolution it can be tried to avoid this effect

to a certain degree. However, a true solution would need the identification and

isolation of language through the computer even within the polyphonic mix of a

musical recording. This is still quite far from becoming feasible. More realistic

is at least the automatic distinction between passages with singing voice and

those without (e. g. [NW04]). Based on this discrimination it could be tried to

disregard the timbre changes due to the voice by simply skipping these segments

or assigning them a fixed timbre symbol.

Harmonic Complexity

For harmonic complexity it is also safer not to rely on a very high level descrip-

tion on the input side. Although the automatic extraction of chords is making

progress (see for example [HS05] or [DC04]) results are still not accurate enough

to allow for an automated complexity estimation based on high level musical

rules. What seems more appropriate instead is again the application of a com-

plexity measure on an intermediate abstraction level.

As described in section 5.1 the harmonic pitchclass profile (HPCP) seems a

good candidate as an audio feature in this context. We can identify two properties

of this feature that could contribute to a harmonic complexity estimation. The

first one is the clarity of the chords, with a pure triad being of lowest complexity,

while extra notes like the seventh or the ninth increase complexity. As mentioned,

this could be captured by using correlation or distance measures for the HPCP-

vectors and prototypical triad vectors in the line of tonal strength [GH04].

The second property focusses on the harmonic changes for consecutive vec-

tors. If the changes are expectable, because the perceived distance between

consecutive chords is small this would lead to a low complexity level, whereas

harmonic sequences containing perceptually large jumps would be considered

more complex. So the difficulty here is the mapping of the HPCP-vectors onto a


CHAPTER 6. FUTURE WORK 61

Figure 6.1: Construction of the chord spiral from the pitch spiral [Che00].

representation that allows for an easy calculation of perceived distances between

chords. The spiral array model developed by Elaine Chew [Che00] seems to be

an appropriate candidate for this task. Although the distances do not perfectly

match the perceptual ones in all cases, the model is already coming very close

to our needs. Especially interesting here is the underlying hierarchical structure.

As shown in figure 6.1 by starting with a spiral of pitch classes (ordered as in the

circle of fifth) a second spiral representing the major triads can be constructed

within the first one. This is done by taking the center of gravity of the triangle

spanned by the tonic, the fifth and the third as a reference point for the chord.

The same can be done for minor triads. Chew examined different height gains

of the spiral and different weights for tonic, fifth, and third in her dissertation in

order to obtain satisfying distances.

The important point for our application is that the chord spiral is derived

from the pitch class spiral. Although we cannot expect to find ideal triads in the

HPCP-vector, we still have a pitch class representation of the musical content


CHAPTER 6. FUTURE WORK 62

in this feature. So by calculating the center of gravity for an HPCP-vector in

the pitch class spiral we arrive at a certain chord representation. Furthermore

we can hope, that the distances between these chord representations do relate

to some degree with the perceived distances. This way, by calculating the av-

erage speed when moving through the spiral, we can have an estimate of the

harmonic complexity without the need for an explicit chord recognition or even

segmentation.

Structural and Melodic Complexity

These last two facets of complexity will be treated with low priority only, since

both have to rely on the extraction of a very abstract descriptor. Particularly for

melodic complexity this “preprocessing” step is the most important missing link

in the chain at the current state of the art. Considering the level of performance of

today’s melody transcription algorithms an application of the existing complexity

models appears realistic only in several years from now.

Regarding the extraction of music structure we can hope to arrive earlier

at acceptable results. While the number of different parts or the change rate

are then very straightforward complexity estimations, the periodicity and the

dissimilarity need to be addressed more carefully. For the periodicity again an

entropy estimation could be used, but we have to expect very short sequences

here. Hence, even the number of repetitions might already be a useful indicator.

The dissimilarity forms a much harder problem. Since the problem of music

similarity estimation has not been solved completely (despite a lot of research),

also the opposite, the absence of similarity is difficult to quantify. Furthermore,

for our application an absolute measure would be needed, because finally we want

to compare dissimilarities between segments of one track with the dissimilarities

between the ones from other tracks. As a workaround it could be considered to

simply use the novelty scores from the segmentation process that appeared at

the segment transitions. However, the notion of these is somewhat different from

the real intention, because then only the local region around the transition and

only the features used by the segmentation algorithm would have an influence.


CHAPTER 6. FUTURE WORK 63

6.1.2 Evaluation Strategies

One important aspect of any research involving the development of models or

algorithms is the evaluation of the achievements. This evaluation can be seen

under two slightly different aspects. One is the direct validation of the models’

prediction against a reliable ground-truth. The other is the proof that the ap-

plication of the model or the algorithm yields a noticeable benefit in the desired

context. As mentioned in section 5.3 already it is very costly to obtain a solid

ground-truth for all proposed complexity facets. Several listening tests with a

group of subjects and a sufficient number of tracks would be needed. Considering

that we are aiming at a complexity estimation for full tracks and not for short

excerpts gives rise to even more difficulties, since this prolongs the annotation

process.

For these reasons, this strategy will only be followed in singular cases during

the PhD research work. Rather the author plans to utilize indirect measurements

by using relevant existing annotations together with machine learning as well as

statistical analysis methods. As demonstrated with the danceability descriptor

this approach can give direct insights into the fitness of a particular descriptor

for its application to music retrieval tasks. Apart from being more application

oriented another advantage is that the potential of the large music collection

with its diversity of semantic annotations owned by the MTG can readily be

exploited.

However, additional listening tests on a smaller scale might be considered.

Instead of asking the listeners to annotate their complexity impressions there is

also the possibility to let them directly evaluate the performance of the algo-

rithms. For example clusters of tracks could be built according to the extracted

complexity values and the participants could then rate the consistency of the

clusters, or the comprehensibility of the underlying concept. While this is a nice

way to validate the implemented model it has the disadvantage that further op-

timization of the model requires again a listening test in order to know whether

a measurable improvement was achieved.


CHAPTER 6. FUTURE WORK 64

6.1.3 Efficiency Aspects

The intended application of the descriptors developed in the course of this re-

search are about large music collections. This is a point worth mentioning, be-

cause for big collections with a large number of tracks the advantages of the de-

picted interaction enhancements do “kick in” noticeably. But on the other hand

this means that the computational cost for calculating the descriptors has to

be reasonable. Traditionally file based (off-line) computation opposed to stream

based computation always has the advantage that real-time is not the ultimate

limit. However, scalability has to be considered when we are dealing with collec-

tions that contain a large number of tracks. Already a collection with 1000 titles

would mean more than two full days of “real-time” computing if we assume an

average track duration of 3 minutes. Hence, computational efficiency is a very

relevant issue when the final usability is considered. Nevertheless, it will only

play a minor role during the research within the PhD thesis scope. The main

focus lies in providing functioning models rather than speed-optimized imple-

mentations. But efficiency will certainly be taken into account when alternative

or simplified ways for the computation exist.

One example is the danceability computation. As described in section 5.2 the

computation of this descriptor is relatively costly, because it involves a detrending

operation in a sliding window. In order to reduce the number of operations it is

therefore worthwhile to investigate whether a bigger hop size than 1 sample still

leads to acceptable results. Apart from this option it is also of interest to compare

the output of the described approach with the one based on a method published

by Willson and Francis [WF03]. They claim that the detrended fluctuation

analysis (DFA) that forms the basis of the danceability descriptor is “essentially

equivalent” to a special spectral analysis of the signal based on a conventional

Fourier Transformation. Since with the Fast Fourier Transform (FFT) a highly

efficient implementation is at hand, this would mean a much faster computation

of the descriptor.

Another aspect in this context are synergy effects. Because we address the

different facets of complexity individually, it is possible that in some models the

same intermediate step (e. g. a FFT or the extraction of a low-level descriptor) is

part of the computation. It is therefore useful to identify these commonalities in

order to avoid repeated processing. This does not end with the set of complexity


CHAPTER 6. FUTURE WORK 65

facets. In a final application where also other semantic descriptors are involved an

intelligent data management can improve the efficiency considerably. However,

this already reaches beyond the scope of the PhD thesis.

6.2 Beyond the Horizon

Once a reliable complexity estimation is available there are several ways to pro-

ceed. One interesting topic is the learning of user preferences from listening habits

or from the compound of a personal music collection. As pointed out in section

3.2 already, this could lead to the establishment of user “complexity profiles” cap-

turing the preferred levels of complexity for the different facets. With Berlyne’s

theory [Ber71] of optimal arousal potential and the reported experimental find-

ings from section 4.1 in mind, such a profile could contribute to the prediction

of musical preference for new music. In combination with other descriptors a

completely automatic music recommendation engine could be developed, which

not even needed to be triggered actively by the user. The selection of the appro-

priate descriptors (which might very well vary from one user to the other) and

their weighting are the key points here. This is one task future research could

address.

A second aspect is the refinement of the complexity estimations. As stated

in chapters 2 and 3, the models developed in the context of this research are

meant to reflect a “common sense” judgement of complexity facets on a rather

rough scale, nuances remain unrevealed. However, they also might be useful

to consider, for example if a collection is otherwise very homogeneous. Future

research could try to optimize and enhance the models towards this goal.

It is likely though that this optimization is only useful when the (limited)

universality constraint is given up or at least relaxed. In order to arrive at a more

detailed level of complexity estimations it would probably be necessary to take

an individual user’s background into account instead of assuming a generalized

one. We would give up the objectivity and comparability of the descriptor. The

complexity descriptors of the same track might no longer be the same when

computed with complexity models of different users. But on the other hand

we could think of virtual music agents, that gather musical knowledge together

with the user and adapt their complexity judgements according to his knowledge.

They could give truly personalized and individual recommendations. But this


CHAPTER 6. FUTURE WORK 66

approach has little in common with what is the defined goal of the presented

research. The complexity models would need to be purposely designed for the

adaptation to individual musical experience and knowledge. Only a variation of

parameters would not do the job here. So this is definitely a task that future

research has to take care of.

Finally, it should be mentioned that complexity models could even be a useful

integration within certain algorithms that aim at a high level semantic descrip-

tion of a different kind. One example is the music segmentation, which could

make use of changes in an instantaneous complexity description. But also for

melody extraction the complexity evaluation of different alternatives might make

sense when there are several candidates for the continuation. In a Viterbi-like

manner the algorithm could try to estimate the resulting complexity for different

“paths” and give preference to one with a certain level of complexity (e. g. avoid-

ing extreme values). Similar ideas are thinkable for tempo tracking or chord

transcription. In fact, Pressing and Lawrence [PL93] are using a very similar

approach in their auto-notation program “Transcribe”. The program tries to

visualize a MIDI or audio input in the optimal way with respect to cognitive

criteria such as the cost functions for producing syncopated rhythmic patterns.


Conclusion

In this document the concept of a multi-faceted music complexity descriptor set

has been presented. We saw that there are useful applications for this type of

descriptors in the field of digital music collection interaction. Some psychological

and musicological studies have been reviewed that give evidence to the signifi-

cance of music complexity with respect to music preference. This alone already

makes music complexity an interesting means to navigate through a collection

or to select tracks for a user. Also other practical aspects speak in favor of the

proposed complexity descriptors as their compactness in representation (storage

and visualization) and their intuitiveness (querying).

By reviewing former research on this field it was shown that no satisfying

solution for the automatic estimation of music complexity exists to date. The

proposed models for melodic, harmonic, and rhythmic complexity that can be

found in the literature rely on an accurate symbolic representation of the music,

which is not possible to reach automatically from the musical audio signal at

present. While some previous work can be adapted and several related findings

can be combined and enhanced, there are still many gaps to fill in order to arrive

at a complete and fully functional solution that satisfies our needs. Therefore,

a set of operational definitions for the different complexity facets has been pre-

sented in order to clarify in which direction and by which means the author

is working on the topic. Preliminary implementations of the rhythmic and the

acoustic facet of complexity have been described and experimental results with

them were reported. Especially for the tested rhythmic complexity measure they

appear to be very encouraging. The implemented danceability descriptor re-

veals clearly a link to high-level musical attributes, which was demonstrated by

statistical analysis and machine learning experiments.

In the chapter on future work the next steps that are to be taken were de-

picted. The focus will lie on filling in the missing parts, especially regarding

67


CONCLUSION 68

Acoustic, Rhythmic, Timbre, and Harmonic complexity. For each of these facets

a practical and realistic outline of an algorithmic approach has been given, while

experiences from preliminary experiments were taken into account. Since the

evaluation with an annotated ground-truth is problematic for this descriptor set

due to the high cost of manual annotation, a more application driven approach

has been given the preference. As in the mentioned experiments with the im-

plementation of the danceability measure, the focus will lie more on statistical

analysis and machine learning methods than on formal listening tests with a

group of subjects. However, in individual cases the latter will also be considered

when feasible. Also the computational efficiency of the proposed methods will

be considered in the future development, since for large collections this point is

highly relevant. Yet, as we will be able to do only the first steps in this domain,

performance in terms of processing speed will take a subordinate position to

performance in terms of functionality.


List of Figures

1.1 Screenshot from a music browser interface displaying part of a song collection

organized by danceability and dynamic complexity. . . . . . . . . . . . . 6

2.1 Circle of music complexity facets. . . . . . . . . . . . . . . . . . . . . . 9

3.1 The Wundt curve for the relation between music complexity and preference. . 17

4.1 Illustration of the Implication-Realization Process (from [SAPM02]). . . . . 26

5.1 Instantaneous loudness (dots), global loudness (grey solid line), and average

distance margin (dashed lines) for four example tracks. . . . . . . . . . . . 44

5.2 Excerpts from the time series s(n) for three example pieces from different

musical genres. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Double logarithmic plots of mean residual over time scales. . . . . . . . . . 48

5.4 DFA exponent functions for the three example tracks from figure 5.2. . . . . 48

5.5 Top-eight labels with the highest number of assigned artists. . . . . . . . . 51

5.6 α-levels for 60 techno (o) and 60 film score tracks (x), unordered. . . . . . . 52

5.7 Distributions on deciles for the twelve labels with most significant deviation

from equal distribution (solid horizontal lines). . . . . . . . . . . . . . . . 53

6.1 Construction of the chord spiral from the pitch spiral [Che00]. . . . . . . . 61

69


Bibliography

[ABB77] J. B. Allen, D. A. Berkley, and J. Blauert. Multimicrophone signal-
processing technique to remove room reverberation from speech signals.
Journal of the Acoustic Society of America, 62(4):912–915, 1977.

[AS01] J. J. Acouturier and M. Sandler. Segmentation of music signals using
hidden markov models. In Proceedings of the 110th AES Convention, Am-
sterdam, NL, 2001.

[Ass60] American Standards Association. American standard acoustical terminol-
ogy. Definition 12.9, 1960.

[Aus00] M. Ausloos. Statistical physics in foreign exchange currency and stock
markets. Physica A, 285:48–65, 2000.

[Bel03] J. P. Bello. Towards the automated analysis of simple polyphonic music: A
knowledge-based approach. PhD thesis, Queen Mary University of London,
2003.

[Ber60] D. E. Berlyne. Conflict, Arousal, and Curiosity. McGraw-Hill, New York,
Toronto, London, 1960.

[Ber71] D. E. Berlyne. Aesthetics and psychobiology. Appleton-Century-Crofts,
New York, 1971.

[Ber74] D. E. Berlyne. The new experimental aesthetics. In Studies in the new
experimental aesthetics: steps towards an objective psychology of aesthetic
appreciation. Halsted Press, New York, 1974.

[BGH+95] S. V. Buldyrev, A. L. Goldberger, S. Havlin, R. N. Mantegna, M. E. Matsa,
C.-K. Peng, M. Simons, and H. E. Stanley. Long-range correlation proper-
ties of coding and noncoding DNA sequences: Genbank analysis. Physical
Review E, 51(5):5084–5091, 1995.

[Bre90] A. S. Bregman. Auditory Scene Analysis. The MIT Press, Cambridge,
London, 1990.

[Che00] E. Chew. Towards a Mathematical Model of Tonality. PhD thesis, Massa-
chusetts Institute of Technology, Cambridge, USA, February 2000.

[Che03] E. Chew. Thinking Out of the Grid and Inside the Spiral - Geometric
Interpretations of and Comparisons with the Spiral Array Model. Technical
Report 03-002, University of Southern California, 2003.

70


BIBLIOGRAPHY 71

[CKF+04] P. Cano, M. Koppenberger, S. Ferradans, A. Martinez, F. Gouyon, V. Sand-
vold, V. Tarasov, and N. Wack. Mtg-db: A repository for music audio
processing. In Proceedings of 4th International Conference on Web Deliv-
ering of Music, Barcelona, Spain, 2004.

[CV03] W. Chai and B. Vercoe. Structural analysis of musical signals for indexing
and thumbnailing. In Proceedings of the 3d ACM/IEEE-CS Joint Confer-
ence on Digital Libraries, Houston, Texas, 2003.

[DBDS03] C. Duxbury, J. P. Bello, M. Davies, and M. Sandler. Complex domain
onset detection for musical signals. In Proceedings of the 6th International
Conference on Digital Audio Effects, London, UK, 2003.

[DC04] C. Derboven and M. Cremer. A system for harmonic analysis of poly-
phonic music. In Proceedings of the AES 25th International Conference on
Metadata for Audio, London, UK, 2004.

[Edm95] B. Edmonds. What is complexity? - the philosophy of complexity per se
with application to some examples in evolution. In Heylighen and Aerts,
editors, The Evolution of Complexity. Kluwer, Dordrecht, 1995.

[EN00] T. Eerola and A. C. North. Cognitive complexity and the structure of
musical patterns. In Proceedings of the 6th International Conference on
Music Perception and Cognition, Newcastle, UK, 2000.

[Ess95] P. Essens. Structuring temporal sequences: Comparison of models and
factors of complexity. Perception and Psychophysics, 57(4):519–532, 1995.

[ET04] T. Eerola and P. Toiviainen. Midi toolbox: Matlab tools for music research.
http://www.jyu.fi/musica/miditoolbox/, 2004.

[ETK02] T. Eerola, P. Toiviainen, and C. L. Krumhansl. Real-time prediction of
melodies: Continuous predictability judgements and dynamic models. In
Proceedings of the 7th International Conference on Music Perception and
Cognition, Sydney, Australia, 2002.

[Fin89] L. Finnäs. How can musical preference be modified? a research review.
Bulletin of the Council for Research in Music Education, 102:1–58, 1989.

[Fuj99] T. Fujishima. Realtime chord recognition of musical sound: A system using
common lisp music. In Proceedings of the International Computer Music
Conference, Beijing, China, 1999.

[GD05] F. Gouyon and S. Dixon. A review of automatic rhythm transcription
systems. Computer Music Journal, 29(1):34–54, 2005.

[GH04] E. Gomez and P. Herrera. Automatic extraction of tonal metadata from
polyphonic audio recordings. In Proceedings of the AES 25th International
Conference on Metadata for Audio, London, UK, 2004.

[Gom04] E. Gomez. Tonal description of polyphonic audio for music content process-
ing. INFORMS Journal on Computing. Special Cluster on Music Comput-
ing, 2004.


BIBLIOGRAPHY 72

[Gri99] D. Griesinger. Objective measures of spaciousness and envelopment. In
Proceedings of the AES 16th International Conference on Spatial Sound
Reproduction, Rovaniemi, Finland, 1999.

[GSO+05] E. Gomez, S. Streich, B. Ong, R. P. Paiva, S. Tappert, J.-M. Batke, G. Po-
liner, D. Ellis, and J. P. Bello. A quantitative comparison of different
approaches for melody extraction from polyphonic audio recordings. Sub-
mitted to IEEE Transactions on Speech and Audio Processing, 2005.

[GV99] A. Gammerman and V. Vovk. Kolmogorov complexity: Sources, theory
and applications. Computer Journal, 42(4):252–255, 1999.

[Hey75] R. G. Heyduk. Rated preference for musical compositions as it relates
to complexity and exposure frequency. Perception and Psychophysics,
17(1):84–91, 1975.

[HPD03] P. Herrera, G. Peeters, and S. Dubnov. Automatic classification of musical
instrument sounds. Journal of New Music Research, 32(1), 2003.

[HS05] C. Harte and M. Sandler. Automatic chord identification using a quantized
chromagram. In Proceedings of the AES 118th Convention, Barcelona,
Spain, 2005.

[JIM+04] H. D. Jennings, P. Ch. Ivanov, A. M. Martins, P. C. da Silva, and G. M.
Viswanathan. Variance fluctuations in nonstationary time series: a compar-
ative study of music genres. Physica A: Statistical and Theoretical Physics,
336(3-4):585–594, May 2004.

[KL05] D. Kusek and G. Leonhard. The Future of Music. Berklee Press, 2005.

[Kla03] A. Klapuri. Musical meter estimation and music transcription. Presented
at Cambridge Music Colloqium, 2003.

[Kla04] A. Klapuri. Signal Processing Methods for the Automatic Transcription of
Music. PhD thesis, Tampere University of Technology, 2004.

[Ler96] F. Lerdahl. Calculating tonal tension. Music Perception, 13(3):319–363,
1996.

[Mas02] R. Mason. Elicitation and measurement of auditory spatial attributes in
reproduced sound. PhD thesis, University of Surrey, February 2002.

[Mat01] L. Matassini. Signal analysis and modelling of non-linear non-stationary
phenomena. PhD thesis, Bergische Universität Gesamthochschule Wupper-
tal, Wuppertal, Germany, July 2001.

[MGB97] B. C. Moore, B. R. Glasberg, and T. Baer. A model for the prediction of
thresholds, loudness, and partial loudness. Journal of the Audio Engineer-
ing Society, 45(4):224–239, April 1997.

[Nar90] E. Narmour. The analysis and cognition of basic melodic structures. The
University of Chicago Press, Chicago and London, 1990.

[NH97a] A. C. North and D. J. Hargreaves. Experimental aesthetics and everyday
music listening. In The social psychology of music, pages 84–103. Oxford
University Press, Oxford, 1997.


BIBLIOGRAPHY 73

[NH97b] A. C. North and D. J. Hargreaves. Liking, arousal potential, and emotions
expressed by music. Scandinavian Journal of Psychology, 38:45–53, 1997.

[NH97c] A. C. North and D. J. Hargreaves. Music and consumer behaviour. In
The social psychology of music, pages 268–289. Oxford University Press,
Oxford, 1997.

[NW04] T. L. Nwe and Y. Wang. Automatic detection of vocal segments in pop-
ular songs. In Proceedings of the 2004 International Symposium on Music
Information Retrieval, Barcelona, Spain, 2004.

[OH04] B. Ong and P. Herrera. Computing structural descriptions of music through
the identification of representative excerpts from audio files. In Proceedings
of the AES 25th International Conference on Metadata for Audio, London,
UK, 2004.

[OO05] M. G. Orr and S. Ohlsson. Relationships between complexity and liking as
a function of expertise. Music Perception, 22(4):583–611, 2005.

[Pac99] F. Pachet. Surprising harmonies. International Journal of Computing An-
ticipatory Systems, 4, February 1999.

[Pam01] E. Pampalk. Islands of music: Analysis, organization, and visualization of
music archives. Master’s thesis, Vienna University of Technology, Vienna,
Austria, 2001.

[Par04] R. M. Parry. Musical complexity and top 40 chart performance. Technical
report, College of Computing, Georgia Institute of Technology, 2004.

[PBH+94] C.-K. Peng, S. V. Buldyrev, S. Havlin, M. Simons, H. E. Stanley, and A. L.
Goldberger. Mosaic organization of DNA nucleotides. Physical Review E,
49:1685–1689, 1994.

[Pen05] C.-K. Peng. Fractal mechanics in neural control: Human heartbeat
and gait dynamics in health and disease. Online Tutorial, 2005.
http://www.physionet.org/tutorials/fmnc/.

[PFTV92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numer-
ical Recipes in C: The Art of Scientific Computing. Cambridge University
Press, Cambridge, 1992.

[PL93] J. Pressing and P. Lawrence. Transcribe: a comprehensive autotranscrip-
tion program. In Proceedings of the 1993 International Computer Music
Conference, pages 343–345, Tokyo, Japan, 1993.

[PRC00] F. Pachet, P. Roy, and D. Cazaly. A combinatorial approach to content-
based music selection. IEEE MultiMedia, 7(1):44–51, 2000.

[Pre98] J. Pressing. Cognitive complexity and the structure of musical patterns.
In Proceedings of the 4th Conference of the Australasian Cognitive Science
Society, Newcastle, Australia, 1998.

[SAPM02] E. G. Schellenberg, M. Adachi, K. T. Purdy, and M. C. McKinnon. Ex-
pectancy in melody: Tests of children and adults. Journal of Experimental
Psychology, 131(4):511–537, 2002.


BIBLIOGRAPHY 74

[SBM+02] D. van Steelant, B. de Baets, H. de Meyer, M. Leman, J.-P. Martens,
L. Clarisse, and M. Lesaffre. Discovering structure and repetition in musical
audio. In Proceedings of Eurofuse Workshop, Varenna, Italy, 2002.

[Sch89] M. Schmuckler. Expectation in music: Investigation of melodic and har-
monic processes. Music Perception, 7:109–150, 1989.

[Sch96] E. G. Schellenberg. Expectancy in melody: Tests of the implication-
realization model. Cognition, 58(1):75–125, 1996.

[Sch00] E. D. Scheirer. Music-Listening Systems. PhD thesis, Massachusetts Insti-
tute of Technology, Cambridge, USA, 2000.

[Sch01] E. D. Scheirer. Structured audio, kolmogorov complexity, and general-
ized audio coding. IEEE Transactions on Speech and Audio Processing,
9(8):914–931, 2001.

[SE03] A. Sheh and D. Ellis. Chord segmentation and recognition using em-trained
hidden markov models. In Proceedings of the International Symposium on
Music Information Retrieval, Baltimore, USA, 2003.

[SH05] S. Streich and P. Herrera. Detrended fluctuation analysis of music signals:
Danceability estimation and further semantic characterization. In Proceed-
ings of the AES 118th Convention, Barcelona, Spain, 2005.

[Sha48] C. E. Shannon. A mathematical theory of communication. The Bell System
Technical Journal, 27, 1948.

[Sim94a] D. K. Simonton. Computer content analysis of melodic structure: Classical
composers and their compositions. Psychology of Music, 22:31–43, 1994.

[Sim94b] D. K. Simonton. Melodic structure and note transition probabilities: A
content analysis of 15,618 classical themes. Psychology of Music, 22:31–43,
1994.

[SM75] L. Steck and P. Machotka. Preference for musical complexit: Effects of
context. Journal of Experimental Psychology: Human Perception and Per-
formance, 104(2):170–174, 1975.

[SN04] E. Skovenborg and S. H. Nielsen. Evaluation of different loudness models
with music and speech material. In Proceedings of the AES 117th Conven-
tion, San Francisco, USA, 2004.

[Sny00] B. Snyder. Music and Memory. The MIT Press, Cambridge, London, 2000.

[SP00] I. Shmulevich and D.-J. Povel. Measures of temporal pattern complexity.
Journal of New Music Research, 29(1):61–69, 2000.

[Sta01] R. K. Standish. On complexity and emergence. Complexity International,
9, 2001. http://journal-ci.csse.monash.edu.au/ci/vol09/standi09/.

[SWV00] E. D. Scheirer, R. B. Watson, and B. L. Vercoe. On the perceived com-
plexity of short musical segments. In Proceedings of the 2000 International
Conference on Music Perception and Cognition, Keele, UK, 2000.


BIBLIOGRAPHY 75

[SYSPI05] H. F. Silverman, Y. Yu, J. M. Sachar, and W. R. Patterson III. Performance
of real-time source-location estimators for a large-aperture microphone ar-
ray. IEEE Transactions of Speech and Audio Processing, 2005. Accepted
for Publication.

[TBB00] B. Tillmann, J. J. Bharucha, and E. Bigand. Implicit learning of music: A
self-organizing approach. Psychological Review, 107:885–913, 2000.

[Tem01] D. Temperley. The cognition of basic musical structures. the MIT Press,
Cambridge, London, 2001.

[Vic01] E. Vickers. Automatic long-term loudness and dynamics matching. In
Proceedings of the AES 111th Convention, New York, 2001.

[VSV+03] M. Vilermo, S. Streich, M. Väänänen, K. Linzmeier, B. Grill, and Y. Wang.
Perceptual optimization of the frequency selective switch in scalable audio
coding. In Proceedings of the AES 114th Convention, Amsterdam, Nether-
lands, 2003.

[Wal73] E. L. Walker. Psychological complexity and preference: A hedgehog theory
of behaviour. In D. E. Berlyne and K. B. Madsen, editors, Pleasure, reward,
preference. Academic Press, New York, 1973.

[WF03] K. Willson and D. P. Francis. A direct analytical demonstration of the
essential equivalence of detrended fluctuation analysis and spectral analysis
of RR interval variability. Physiological Measurment, 24:N1–N7, 2003.

[Wit01] T. Wittkop. Two-channel noise reduction algorithms motivated by models
of binaural interaction. PhD thesis, Universität Oldenburg, Oldenburg,
Germany, 2001.

[WS02] Y. Wang and S. Streich. A drumbeat-pattern based error concealment
method for music streaming applications. In Proceedings of the Interna-
tional Conference on Acoustics, Speech and Signal Processing, Orlando,
USA, 2002.

[Wun74] W. Wundt. Grundzüge der physiologischen Psychologie. Engelmann,
Leipzig, 1874.

[ZF90] E. Zwicker and H. Fastl. Psychoacoustics – Facts and Models. Springer,
Berlin, Germany, 1990.

[ZL77] J. Ziv and A. Lempel. A universal algorithm for sequential data compres-
sion. IEEE Transactions on Information Theory, 23(3):337–343, 1977.

[ZL78] J. Ziv and A. Lempel. Compression of individual sequences via variable-
rate coding. IEEE Transactions on Information Theory, 24(5):530–536,
1978.