Tải bản đầy đủ (.pdf) (9 trang)

báo cáo hóa học: " Multi-label classification of music by emotion" pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (277.76 KB, 9 trang )

RESEARCH Open Access
Multi-label classification of music by emotion
Konstantinos Trohidis
1*
, Grigorios Tsoumakas
2
, George Kalliris
1
and Ioannis Vlahavas
2
Abstract
This work studies the task of automatic emotion detection in music. Music may evoke more than one different
emotion at the same time. Single-label classification and regression cannot model this multiplicity. Therefore, this
work focuses on multi-label classification approaches, where a piece of music may simultaneously belong to more
than one class. Seven algorithms are experimentally compared for this task. Furthermore, the predictive power of
several audio features is evaluated using a new multi-label feature selection method. Experiments are conducted
on a set of 593 songs with six clusters of emotions based on the Tellegen -Watson-Clark model of affect. Results
show that multi-label modeling is successful and provide interesting insights into the predictive quality of the
algorithms and features.
Keywords: multi-label classification, feature selection, music information retrieval
1. Introduction
Humans, by nature, are emotionally affected by music.
Who can argue against the famous quote of the German
philosopher Friedrich Nietzsche, who said that ‘without
music, life would be a mistake’.Asmusicdatabases
grow in size and number, the retrieval of music by emo-
tion is becoming an important t ask for various applica-
tions, such as song selection in mobile devices [1],
music recommendation systems [2], TV and radio pro-
grams
a


, and music therapy.
Past approaches towards automated detection of emo-
tions in music modeled the learning problem as a sin-
gle-label classification [3,4] regression [5], or multi-label
classification [6-9] task. Music may evoke more than
one different emotion at the same time. Single-label
classification and regression cannot model this multipli-
city. Therefor e, the focus of this ar ticle is on multi-label
classification methods. The primary ai m of this art icle is
twofold:
• The experimental evaluation of seven multi-label
classification algorithms using a variety of evaluation
measures. Previous work experimented with just a
single algorithm. We employ some recent develop-
ments in multi-label classification and show which
algorithms perform better for musical data.
• The creation of a new multi-label dataset with 72
music features for 593 songs categorized into one or
more out of 6 classes of emotions. The dataset is
released to the public
b
, in order to allow compara-
tive experiments by other researchers. Publicly avail-
able multi-label music datasets are rare, hindering
the progress of research in this area.
The remaining of this article is structured as follows.
Sections 2 reviews related w ork and Sections 3 an d 4
provide background material on multi-label classification
and emotion modeling, respectively. Section 5 presents
the details of the dataset used in this work. Section 6

presents experimental results comparing the seven
multi-label classification algorithms. Finally, conclusions
are drawn and future work is proposed in Section 7.
2. Related work
This section discusses past efforts on emotion detection
in music, mainly in terms of emotion model, extracted
features, and the kind of modeling of the learning pro-
blem: (a) single label classification, (b) regression, and
(c) multi-label classification.
2.1. Single-label classification
The four main emotion classes of Thayer’smodelwere
used as the emotion model in [3]. Three different fea-
ture sets were adopted for music representation, namely
intensity, timbre, and rhythm. Gaussian mixture models
were used to model each of the four classes. An
* Correspondence:
1
Department of Journalism and Mass Communication, Aristotle University of
Thessaloniki, Thessaloniki, 54124, Greece
Full list of author information is available at the end of the article
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>© 2011 Trohidi s et al; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License ( whi ch permits unrestricted use, distribution, and reproduction in
any medium, provid ed the original work is prop erly cited.
interesting contribution of this work, was a hierarchical
classification process, which first classifies a song into
high/low energy (vertical a xis of Thayer’s model), and
then into one of the two high/low stress classes.
The same emotion classes were used in [4]. The
authors experimented with two fuzzy classifiers, using

the 15 features proposed in [10] They also experimented
with a feature selection method, which improved the
overall accuracy (around 78%), but they do not mentio n
which features were selected.
The classification o f songs into a single cluster of
emotions was a new category in the 2007 MIREX
(Music Information Retrieval Evalua tion eXchange)
competition
c
. The top two submissions of the competi-
tion were based on support vector machines (SVM).
The model of mood that was used in the competition
included five clusters of moods proposed in [11], which
was compiled based on a statistical analysis of the rela-
tionship of mood with genre, artist, and us age metadata.
Among the many interesting conclusion of the competi-
tion, was the difficulty to discern between certai n clus-
ters of moods, due to their semantic overlap. A multi-
label classification approach could overcome this pro-
blem, by allowing the specification of multiple finer-
grain emotion classes.
2.2. Regression
Emotion recognition is modeled as a regression task in
[5]. Volunteers rated a training collection of songs in
terms of arousal and valen ce in an ordinal scale of 11
values from -1 to 1 with a 0.2 step. The authors then
trained regression models using a variety of algorithms
(with SVMs h aving the best performance) and a variety
of extracted features. Finally, a user could retrieve a
song by selecting a point in the two-dimensional arousal

and valence mood plane of Thayer.
Furthermore, the authors used a feature selection
algorithm, leading to an increase of the predictive per-
formance. However, it is not clear if the authors run the
feature selection process on all input data on each fold
of the 10-fold cross-validation used to evaluate the
regressors. I f the former is true, then their results may
be optimistic, as the feature selection algorithm had
access to the t est data. A similar pitfall of feature selec-
tion in music classification is discussed in [12].
2.3. Multi-label classification
Both regression and single-label classification m ethods
suffer from the same problem: two different (clusters of)
emotions cannot be simultaneously predicted. Multi-
label classification allows for a natural modeling of this
issue.
LiandOgihara[6]usedtwoemotionmodels:(a)the
ten adjective clusters of Farnsworth (extended with
three clusters of adjectiv es proposed by the l abeler) and
(b) a further clustering of those into six super-clusters.
They only experimented with the BR multi-label classifi-
cation method using SVMs as the underlying base sin-
gle-label classifier. In terms of features, they used
Marsyas [13] to extract 30 features related to the timbral
texture, rhythm, and pitch. The predictive performance
was low for the clusters and better for the super -clus-
ters. In add ition, they found evidence that genre is cor-
related with emotions.
In an ex tension of their work, Li and Ogihara [7] con-
sidered three bipolar adjective pairs Cheerful vs. Depres-

sing, Relaxing vs. Exciting, and Comforting vs.
Disturbing. Each track was initially labeled using a scale
ranging from -4 to +4 by two subjects and then con-
verted to a binary (positive/negative) label. The learning
approach was the same with [6]. The feature set was
expanded with a new extraction m ethod, called Daube-
chies wavelet coefficient histograms. The authors report
an accuracy of around 60%.
The same 13 clusters as in [6] were used in [8], where
the authors modified the k Nearest Neighbors algorithm
in order to handle multi-label data directly. They found
that the predictive performance was low, too. Recently,
Pachet and Roy [14] used stacked binary relevance
(2BR ) for the multi -label classification of music samples
into a large number of labels (632).
Compared to our work, none of the a forementioned
approaches discusses feature selection from multi-label
data, compares different multi-label classification algo-
rithms or uses a variety of multi-label e valuation mea-
sures in its empirical study.
3. Multi-label classification
Traditional single-label classification is concerned with
learning from a set of examples that are associ ated with
a single label l from a set of disjoint labels L, |L| >1.In
multi-label classification, the examples are associated
with a set of labels Y ⊆ L.
3.1. Learning algorithms
Multi-label classification al gorithms can be categorized
into two different groups [15]: ( i) problem transforma-
tion methods, and (ii) algorithm adaptation met hods.

The first group includes methods that are algorithm
independent. They transform the multi-label classifica-
tion task into one or m ore single-label classification,
regression, or ranking tasks. The second group includes
methods that extend specific learning algorithms in
order to handle multi-label data directly.
We next present the methods that are used in th e
experimental part of this work. For the formal descrip-
tion of these methods, we will use L ={(lj: j =1 M}to
denote the finite set of labels in a multi-label learning
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 2 of 9
task and D ={(x
i
,Y
i
), i =1 N} to denote a set of multi-
label training examples, where x
i
is the feature vector
and Yi ⊆ L the set of labels of the i-th example.
Binary relevance (B R) is a popular p robl em transfor-
mation method that learns M binary classifiers, one for
each different label in L. It transforms the origina l data-
set into M dat asets
D
λ
j
: j =1
M

that contain all
examples of the original dataset, labeled positiv ely if the
label set of the original example contained l
j
and nega-
tively otherwise. For the classification of a new instance,
BR outputs the union of the labels l
j
that are positively
predicted by the M classifiers.
BR is criticized, because it does not take into account
label correlations and may fail to accurately predict label
combinations or ran k labels according t o relevance with
a new instance. One approach that has been proposed
in the past in order to deal with the aforementioned
problem of BR, works g enerally as follows: it learns a
second (or Meta) level o f binary models (one for each
label) that consider as input the output of all first (or
base) level binary models. It will be called 2BR, as it
uses the BR method twice, in two consecutive levels.
2BR follows the paradigm of stacked generalization [16],
a method for the fusion of h eterogeneous classifiers,
widely known as stacking. One of the earliest account of
2BRis[17],where2BRwaspartoftheSVM-HF
method, a SVM based algorithm for training the binary
models of both levels. The abstraction of SVM-HF irre-
spectively of SVMs and its relation to stacking was
pointed out in [18]. 2BR was very recently applied to
the analysis of musical titles [14].
Label powerset (LP) is a s imple but effective problem

transformation method that works as follows: it consid-
ers each unique set of labels that exists in a multi-label
training set as one of the classes of a new single-label
classification task. Given a new instance, the single-label
classifier of LP outputs the most probable class, which is
actually a set of labels.
The computational complexity of LP with respect to
M de pends on the complexity of the base classifier with
respect to the number of classes, which is equal to the
number of distinct label sets in the training set. This
number is upper bounded by min (N,2
M
)anddespite
that it typically is much smaller, it still poses an impor-
tant complexity problem, especially for large values of N
and M. Furthermore, the large number of classes, many
of which are associated with very few examples, makes
the learning process difficult as well.
The random k-labelsets (RAkEL) method [19] con-
structs an ensemble of LP classifiers. Each LP classifier
is trained using a different small random subset of the
set of labels. This way RAkEL manages to take label cor-
relations into account, while avoiding LP’sproblems.A
ranking of the labels is produced by averaging the zero-
one predictions of each model per considered label.
Thresholding is then used to produce a classification as
well.
Ranking by pairwise comparison (RPC) [20] trans-
forms the multi-label dataset into
M(M − 1)

2
binary
label datasets, one for each pair of labels (l
i
, l
j
), 1 ≤ i ≤
j ≤ M. Each dataset contains those examples of D that
are ann otated by at least o ne of the two corre sponding
labels, but not both. A binar y classifier that learns to
discriminate between the two labels is trained from each
of these datasets. Given a new instance, all binary classi-
fiers are invoked, and a ranking is obtained by counti ng
the votes received by each label.
Calibrated label ranking (CLR) [21] extends RPC by
introducing an additional virtual label, which acts as a
natural breaking point of th e ranking into relevant and
irrelevant sets of labels . This way, CLR manages to per-
form multi-label classification.
Multi-label back-propagation (BP-MLL) [ 22] is an
adaptation of the popular back-propagation algorithm
for multi-label learning. The main mo dification to the
algorithm is the introduction of a new error function
that takes multiple labels into account.
Multi-label k-nearest neighbor (ML- k NN) [23] extends
the popular k nearest neighbors (kNN) lazy learning
algorithm using a Bayesian approach. It uses the maxi-
mum a posteriori principle in order to determine the
label set of the test instance, based on prior and poster-
ior probabilities for the frequency o f each lab el within

the k nearest neighbors.
3.2. Evaluation measures
Multi-label classification requires different evaluation
measures than traditional single-label classification. A
taxonomy of multi-label c lassification evaluation mea-
sures i s given in [19], which considers two main cate-
gories: example-based and label-based measures. A third
category of measures, which is not directly related to
multi-label classifi cation, but is often used in the litera-
ture, is ranking-based measures, which are nicely pre-
sented in [23].
4. Emotions and music
4.1. Emotional models
Emotions that are experienced and perceived while l is-
tening to music are somehow different than those
induced in everyday life. Many studies indicate the
important distinction between one’sperceptionofthe
emotion(s) expressed by music and the emotion(s)
induced by music. Studies of the distinctions between
perception and induction of emotion have demonstrated
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 3 of 9
that both can be subjected to not only the social context
of the listeni ng experience, but also to personal motiva-
tion [24]. There are different approaches as to how
emotion can be conceptualized and described. The main
approaches that exist in the literature are the categori-
cal, the dimensional, and the prototype approach
[25,26].
According to the categorical appr oach, emotions are

conceptualized as discrete unique entities. According to
several discrete emotion theories, there is a certain basic
number of emotion categories from which all the emo-
tion states are derived such as happiness, sadness, anger,
fear, and disgust [27-32]. Basic emotions are character-
ized by features having distinct functions, are found in
all cultures, assoc iated w ith distinct physiological pat-
terns, experienced as unique feeling states and appear
early in the development of humans [27,29-32]. In stu-
dies investigating music and emotion, the categorical
model of emotions has been modified to better repre-
sent the emotions induced by music. Emotions such as
disgust are often replaced with the emotion of tender-
ness, which is more suitable in the context of music.
While the categorical approach focuses on the distin ct
characteristics that distinguish the emotions from each
other, in the dimensional approach, emotions are
expressed on a two-dimensional system according to
two axes such as valence and arousal. This type of
model was first proposed by Izard [29,30] and later
modified by Wundt [33].
The dimensional approach includes Russell’s [34] cir-
cumplex mode l of affe ct, where all affective states arise
from two independent systems. One is related to arousal
(activation-deactivation) and the other is related to
valence (pleasure-displeasure) and emotions can be per-
ceived as varying degr ees of arousal and valence. Thayer
[35] suggested that the two dimensions of affect are pre-
sented by two-arousal dimensions, tension, and
energetic arousal. The dimensional models have been

criticized in the past by the lack of differentiation of
neighborhood emotions in the valence and arousal
dimensions such as anger and fear [36].
In our study, the Tellege n-Watson-Clark model was
employed. This model (depicted in F igure 1) extends
previous dimensional models emphasizing the value of a
hierarchical perspective by integrating existing models
of emotional expressivity.
It analyses a three-level hierarchy incorporating at the
highest level a general bipolar happiness vs. unhappiness
dimension, an independent positive affect versus nega-
tive affect dimension at the second order level below it,
and discrete expressivity factors of joy, sadness, hostility,
guilt/shame, fear emotions at the base. Similarly, a
three-level hierarchical model of affect links the basic
factors of affect at different levels of abstraction and
integrates previous models into a single scheme. The
key to t his hierarchical structure is the recognition that
the general bipolar factor o f happiness and independent
dimensions of PA and NA are better viewed as different
levels of abstraction within a hierarchical model, rather
than as competing models at the same level of abstrac-
tion. At the highest level of this model, the general
bipolar factor of happiness accounts for the tendency
for PA and NA to be moderately negatively correlated.
Therefore, the hierarchical model of affect accounted for
both the bipolarity of pleasantness-unpleasantness and
the independence of PA and NA, effectively resolving a
debate that occupied the literature for decades.
Over the years, a number of different dimensions have

been proposed. Wundt [33] proposed a three-dimen-
sional scheme with the three dimensions of pleasure-dis-
pleasure, arousal-cal mness, and tension-relaxation.
Schlosberg [37] proposed a three-dimensional model
with three main dimensions expressing arousal, valence,
and control. A similar model was proposed by
a ctiva te the a pplic ationa ctiva te the a pplic ation
High N/
A
Low P/A
High P/A
Amazed
Surprised
Angry
Distressed
Fearful
Discouraged
sad
Sleepy
tired
Quite
still
Calm
relaxed
Happy
joyful
Pleasantness
unpleasantness
Delighted
Alert

Low N/A
Figure 1 The Tellegen-Watson-Clark model of mood (figure reproduced from[51]).
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 4 of 9
Mehrabian [38]. He tried to define a three-dimensional
model with three basic principles related to pleasure,
arousal, and dominance.
Finally, the protot ype approach is based on t he fact
that language and knowledge structures associate with
how people conceptualize information [39]. The proto-
type approach combines effectively the categorical and
dimensional approaches providing the individual con-
tents of emotions and the hierarchical relationship
among them.
5. Dataset
The dataset used for this work consists of 100 songs
from each of the following 7 different genres: Classical,
Reggae, Rock, Pop, Hip-Hop, Techno, and Jazz. The col-
lection was created from 233 musical albums choosing
three songs from each album. From ea ch song, a period
of 30 s after the initial 30 s was extracted.
The resultin g sound clips were stored and converted
into wave files of 22,050 Hz sampling rate, 16-bit per
sample, and mo no. The follo wing subsections present
the features that were extracted from each wave file and
the emotion labeling process.
5.1. Feature extraction
For the feature extr action process, the Marsyas tool [13]
was used, which is a free software framework. It is mod-
ular, fast, and flexible for rapid development and evalua-

tion of computer audition applications and has been
commonly used for music emotion classification and
MIR tasks.
The extracted features fall into two categories: rhyth-
mic and timbre. We select the categories of temporal
and spectral features due to the h ighly correlation with
valence and arousal dimensions of emotion. For exam-
ple, songs with fast tempo are often perceived as having
high arousal. Often fluent and flowing rhythm is usually
associated with positive valence whereas firm rhythm is
associated with negative valence. On the other hand,
high arousal often correlates with brig ht timbre and vice
versa low arousal with soft timbre.
(1) Rhythmic features: The rhythmic features were
derived by extracting periodic changes from a beat his-
togram. An algorithm that identifies peaks using auto-
correlation was implemented. We selected the two
highest peaks and computed their amplitudes, their
BPMs (beats per minute) and the high-to-low ratio of
their BPMs. In addition, three features were calculated
by summing the histogram bins between 40 a nd 90, 90
and 140, and 140 and 250 BPMs, respectively. The
whole process led to a total of eight rhythmic features.
(2) Timbre features: Mel frequency cepstral coeffi-
cients (MFCCs) are used for speech recognition and
music modeling [40]. To derive MFCCs features, the
signal was divided into frames and the amplitude spec-
trum was calculated for each frame. Next, its logarithm
was t aken and converted to Mel scale. Finally , the dis-
crete cosine transform was imple mented. We selected

the first 13 MFCCs.
Another set of three features related to timbre tex-
tures were extracted from the short-term Fourier trans-
form (FFT): spectr al centroid, spectral rolloff, and
spectral flux. This kind of features model the spectral
properties of the signal such as the amplitude spec trum
distribution, brightness, and the spectral change.
For each of the 16 aforementioned features (13
MFCCs, 3 FFT), we calculated the mean, standard
deviation (std), mean standard deviation (mean std), and
standard deviation of standard deviation (std std) over
all frames. This led to a total of 64 features and 8 rhyth-
mic features.
5.2. Emotion labeling
The Tellegen-Watson-Clark model was employed for
labeling the data with emotions. We decided to use this
particular model because it presents a powerful way of
organizing emotions in terms of their affect appraisals
such as pleasant and unpleasant and psycholog ical reac-
tions such as arousal. It is also especiall y useful for cap-
turing the continuous changes in emotional expression
occurring during a piece of music.
The emotional space of music is abstract with many
emotions and a music application based on mood
should combine a series of moods a nd emotions. To
achieve this goal, without using an excessive number of
labels, we reached a compromise retaining only six main
emotional clusters from this model. The c orresponding
labels are presented in Table 1.
The sound clips were annotated by a panel of experts

of age 20, 25, and 30 from the School of Music Studie s
in our University. All experts had a high musical b ack-
ground. During the annotation process, all experts were
encouraged to mark as many emotion labels as possible
induced by music. According to studies of Kivy [41], lis-
teners make a fundamental attribution error in that they
habitually take the expressive properties of music for
what they feel. This argument is strongly supported b y
other studies [42] in which listeners are instructed to
Table 1 Description of emotion clusters
Label Description # of Examples
L1 Amazed-surprised 173
L2 Happy-pleased 166
L3 Relaxing-calm 264
L4 Quiet-still 148
L5 Sad-lonely 168
L6 Angry-fearful 189
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 5 of 9
describe both what they perceived and felt in response
to different music genres. Meyer [43] argues that when
a listener reports that he fe lt an emotion, he describes
the emotion that a passage of music is supposed to indi-
cate rather than what he experienced. Taking this
notion into account, we instructed the subjects to label
the sound clips according to what they felt rather than
what the music produced.
Only the songs with completely identical labeling from
at least two experts were kep t for subs equent experi-
men tation. This process led to a final annotated dataset

of 593 songs. Potential reasons for this unexpectedly
high agreement of the experts are the short track length
and their common background. The last column of
Table 1 indicates the number of examples annotated
with each label. Out of the 593 songs, 178 were anno-
tated with a single label, 315 with two labels, and 100
with three labels.
6. Empirical comparison of algorithms
6.1. Multi-label classification algorithms
We compare the following multi-label classification
algorithms that were introduced in Section 2 : BR, 2BR,
LP, RAkEL, CLR, ML-kNN, and BP-MLL.
The first two approaches were selected as they are the
most basic approaches for multi-label classification
tasks. RAkEL an d CLR were selected, as two recent
methods that have been shown to be more effective
than the first two. Finally, ML- kNN and BP-MLL were
selected, as two recent high- performance representa-
tives of probl em adaptation methods. Apart from BR,
none of the other algorithms have been evaluated on
music data in the past, to the best of our knowledge.
6.2. Experimental setup
LP, BR, RAkEL, and CLR were run using a SVM as the
base classifier. The SVM was trained with a linear kernel
and the complexity constant C equal to 1. An SVM with
the same setup was used for traini ng both the fir st level
and the meta-level models of 2BR. The one-against-one
strategy was used for dealing with multi-class tasks in
the case of LP and RAkEL. RAkEL was run with subset
size equal to 3, number of models equal to twice the

number of labels (1 2), and a threshold of 0.5 which cor-
responds to a default parameter setting. 10-fold cross-
validation was used fo r creating t he necessary meta-data
for 2BR. The number of neighbors in ML- kNN was set
to 10 and the smoothing factor to 1 as recommended in
[23]. As recommended in [22], BP-MLL was run with
0.05 learning rate, 100 ep ochs and the number of hid-
den units equal to 20% of the input units.
Ten different 10-fold cross-validation experiments
were run for evaluation. The results that follow are
averages over these 100 runs of the different algorithms.
Experiments were conducted with the a id of the Mulan
software library for multi-label classificatio n [44], which
includes implementations of all algorithms and evalua-
tion measures. Mulan runs on top of the Weka [45]
machine learning library.
6.3. Results
Table 2 shows the predictive performance of the seven
competing multi-label classification algorithms using a
variety of measures. We e valuate the seven algorithms
using three categories of multi-label evaluation mea-
sures, namely example-based, label-based, and ranking-
based measures. Example-based measures include ham-
ming loss, accuracy, precision, recall, F1-measure, and
subset accuracy. These measures are calculated based on
the average differences of the actual and the predicte d
sets of labels over all test examples.
Label-based measures include micro and macro preci-
sion, recall, F1-measur e, and area under the ROC curve
(AUC). Finally, ranking-based measures include one-

error, coverage, ranking loss, and average precision.
Table 2 shows the predictive performance of the seven
competing multi-label classification algorithms.
6.3.1. Example-based
As far as the example-based measures are concerned,
RAkEL has a quite competitive performance, being best
in Hamming loss, second best in accuracy behind LP,
best in the combination of precision and recall (F
1
), and
second best in subset accuracy behind LP again.
Table 2 Predictive performance of the seven different
multi-label algorithms based on a variety of measures
BR LP RAkEL 2BR CLR ML-
kNN
BP-
MLL
Hamming
Loss
0.1943 0.1964 0.1849 0.1953 0.1930 0.2616 0.2064
Accuracy 0.5185 0.5887 0.5876 0.5293 0.5271 0.3427 0.5626
Precision 0.6677 0.6840 0.7071 0.6895 0.6649 0.5184 0.6457
Recall 0.5938 0.7065 0.6962 0.6004 0.6142 0.3802 0.7234
F
1
0.6278 0.6945 0.7009 0.6411 0.6378 0.4379 0.6814
Subset acc. 0.2759 0.3511 0.3395 0.2839 0.2830 0.1315 0.2869
Micro prec. 0.7351 0.6760 0.7081 0.7280 0.7270 0.6366 0.6541
Micro rec. 0.5890 0.7101 0.6925 0.5958 0.6103 0.3803 0.7189
Micro F

1
0.6526 0.6921 0.6993 0.6540 0.6622 0.4741 0.6840
Micro AUC 0.7465 0.8052 0.8241 0.7475 0.8529 0.7540 0.8474
Macro prec. 0.6877 0.6727 0.7059 0.6349 0.7036 0.4608 0.6535
Macro rec. 0.5707 0.7018 0.6765 0.5722 0.5933 0.3471 0.7060
Macro F
1
0.6001 0.6782 0.6768 0.5881 0.6212 0.3716 0.6681
Macro AUC 0.7343 0.8161 0.8115 0.7317 0.8374 0.7185 0.8344
One-error 0.3038 0.3389 0.2593 0.2964 0.2512 0.3894 0.2946
Coverage 2.4378 1.9300 1.9983 2.4482 1.6914 2.2715 1.7664
Ranking loss 0.2776 0.1867 0.1902 0.2770 0.1456 0.2603 0.1635
Avg. precis. 0.7378 0.7632 0.7983 0.7392 0.8167 0.7104 0.7961
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 6 of 9
Example-based m easures evaluate how well an algo-
rithm calculat es a bi partit io n of the emotions in to rele-
vant and irrelev ant, giv en a music title . LP models
directly the combinatio ns of labels and manages to per-
form well in predicting the actual set of relevant labels.
RAkEL is based on an ensemble of LP classifiers, and as
exp ected further improves the good performance of LP.
One of the reasons for the good performance of LP is
the relatively small number of labels (six emotional clus-
ters). As mentioned in Section 2, LP has problems scal-
ing to large numbers of labels, but RAkEL does not
suffer from such scalability issues.
6.3.2. Label-based
As far as the micro and macro averaged measures are
concerned, LP and RAkEL again excel in the combina-

tion of precision and recall (F
1
) achieving the first two
places among their competitors, while BP-MLL immedi-
ately follows as third best. The mac ro F
1
measure evalu-
ates the ability of the algorithms to correctly identify the
relevance of each label, by avera ging the performance of
individual labels, while the micro F
1
measure takes a
more holistic approac h by summing the distr ibutions of
all labels first and then computing a single measure.
Both measures evaluate in this case the retrieval of rele-
vant music titles by emotion.
6.3.3. Ranking-based
A first clear pattern that can be noticed is the superior-
ity of CLR, as far as the ranking measures are con-
cerned. Bas ed on the pairwise comparisons of labels, it
ranks effectively relevant labels higher than irrelevant
labels. Therefore, if the goal of a music application wa s
to present an ordered set of emotions f or a music title,
then CLR should definitely be the algorithm to employ.
Such an application for example, could be one that
recommends emot ions to human annotators, in order to
assist them in their labor intensive task. The good prob-
ability estimates that CLR obtains for the relevance of
each label through the voting of all pairwise models, is
also indicated by the top performance of CLR in the

micro and macro averaged AUC measures, which are
probability based. BP-MLL is also quite good i n the
ranking measures (apart from one-error) and in the
micro and macro averaged AUC measures, which indi-
cates that it also computes good estimates of the prob-
ability of relevance for each label.
6.3.4. Label prediction accuracy
Table 3 shows the classification accuracy o f t he algo-
rithms for each label (as if they were independently pre-
dicted), along with the average accuracy in the last
column. We notice that based on the ease of predictions
we can rank the labels in the following descending
order L4 (quiet-still), L6 (angry-fearful), L5 (sad-lonely),
L1 (amazed-surprised) , L3 (relaxing-calm), and L2
(happy-pleased). L4 is the easiest with a mean accuracy
of approximately 88%, followed by L6, L5, L1, and L3
with mean accuracies of approximately 81, 80, 79, and
77% respectively. The hardest label is L2 with a mean
accuracy of approximately 72%.
Based on the results, one can see that the classification
model p erforms better for emotional labels such as L4
(quiet-still) rather than L2 (happy-pleased). This is not
at all in agreement with past r esearch [46,47] c laiming
that the happy emotional tone t end to be among the
easiest one to communicate in music.
An explanation for this result is that happiness is a
measure of positive valence and high acti vity. Exp ressive
cues describing the happiness emotion are fast tempo,
small tempo variability, staccato articulation, high sound
level, bright timbre, fast tone attacks, which are more

difficult to model using the musical features extracted.
On the other hand, quiet emotion is just a measure of
energy corresponding t o the activity dimension only,
thus it can be more successfully described and repre-
sented by the features employed.
7. Conclusions and future work
This article investigated the task of multi-label map-
ping of music into emotions. An evaluation of seven
multi-label classification algorithms was performed on
a collection of 593 songs. Among these algorithms,
CLR was the most effective in ranking the emotions
according to relevance to a given song, while RAkEL
was very competitive in providing a bipartition of the
labels into relevant and irrelevant for a given song, as
well as retrieving relevant songs given an emotion. The
overall predictive performance was high and
encourages further investigation of multi-label meth-
ods. The performance pe r each different label varied.
The subjectivity of the label may be influencing the
performance of its prediction.
Multi-la bel classifiers such as CLR and RAkEL could
be used for the automated annotation of large music
collections with multiple emotions. This in turn would
support the implementation of music information retrie-
val systems that query music collections by emotion.
Such a querying capability would be useful for song
selection in various applications.
Table 3 Accuracy of the seven multi-label classification
algorithms per each label
BR LP RAkEL 2BR CLR ML-kNN BP-MLL Avg.

L1 0.7900 0.7907 0.7976 0.7900 0.7954 0.7446 0.7871 0.7851
L2 0.7115 0.7380 0.7584 0.7113 0.7137 0.7195 0.7161 0.7241
L3 0.7720 0.7705 0.7804 0.7661 0.7735 0.7221 0.7712 0.7651
L4 0.8997 0.8992 0.9019 0.9002 0.8970 0.7969 0.8923 0.8839
L5 0.8287 0.8093 0.8250 0.8283 0.8295 0.7051 0.7894 0.8022
L6 0.8322 0.8142 0.8275 0.8320 0.8325 0.7422 0.8054 0.8123
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 7 of 9
Interesting future work directions are the incorpora-
tion of features based on song lyrics [48,49] as well as
the experiment ation with hierarchical m ulti-label cl assi -
fication approaches [50], based on a hierarchical organi-
zation of emotions.
Endnotes
a
/>b
/>c
/>List of abbreviations
AUC: area under the ROC curve; BP-MLL: multi-label back-propagation; BR:
binary relevance; CLR: calibrated label ranking; kNN: k nearest neighbors; LP:
label powerset; ML-kNN: multi-label k-nearest neighbor; RAkEL: random k-
labelsets; RPC: ranking by pairwise comparison; SVM : support vector
machine.
Author details
1
Department of Journalism and Mass Communication, Aristotle University of
Thessaloniki, Thessaloniki, 54124, Greece
2
Department of Informatics, Aristotle
University of Thessaloniki, Thessaloniki, 54124, Greece

Competing interests
The authors declare that they have no competing interest s.
Received: 17 January 2011 Accepted: 18 September 2011
Published: 18 September 2011
References
1. M Tolos, R Tato, T Kemp, Mood-based navigation through large collections
of musical data, in Proceedings of the 2nd IEEE Consumer Communications
and Networking Conference (CCNC 2005),71–75, (3-6 January 2005)
2. R Cai, C Zhang, C Wang, L Zhang, W-Y Ma, Musicsense: contextual music
recommendation using emotional allocation modeling, in Proceedings of
the 15th International Conference on Multimedia, 553–556 (2007)
3. L Lu, D Liu, H-J Zhang, Automatic mood detection and tracking of music
audio signals. IEEE Trans Audio Speech Lang Process. 14(1), 5–18 (2006)
4. Y-H Yang, C-C Liu, H-H Chen, Music emotion classification: a fuzzy
approach, in Proceedings of ACM Multimedia 2006 (MM’06),81–84 (2006)
5. Y-H Yang, Y-C Lin, Y-F Su, H-H Chen, A regression approach to music
emotion recognition. IEEE Trans Audio Speech Lang Process. 16(2), 448–457
(2008)
6. T Li, M Ogihara, Detecting emotion in music, in Proceedings of the
International Symposium on Music Information Retrieval, USA, 239–240
7. T Li, M Ogihara, Toward intelligent music information retrieval. IEEE Trans
Multimedia 8(3), 564–574 (2006)
8. A Wieczorkowska, P Synak, ZW Ras, Multi-label classification of emotions in
music, in Proceedings of the 2006 International Conference on Intelligent
Information Processing and Web Mining (IIPWM’06), 307–315 (2006)
9. K Trohidis, G Tsoumakas, G Kalliris, I Vlahavas, Multilabel classification of
music into emotions, in Proceedings of the 9th International Conference on
Music Information Retrieval (ISMIR 2008), USA (2008)
10. E Schubert, Measurement and Time Series Analysis of Emotion in Music. PhD
thesis, University of New South Wales, (1999)

11. X Hu, JS Downie, Exploring mood metadata: relationships with genre, artist
and usage metadata, in Proceedings of the 8th International Conference on
Music Information Retrieval (ISMIR 2007),67–72 (2007)
12. R Fiebrink, I Fujinaga, Feature selection pitfalls and music classification, in
Proceedings of the International Conference on Music Information Retrieval
(ISMIR 2006), 340–341 (2006)
13. G Tzanetakis, P Cook, Musical genre classification of audio signals. IEEE
Trans Speech Audio Process. 10(5), 293–302 (2002). doi:10.1109/
TSA.2002.800560
14. F Pachet, P Roy, Improving multilabel analysis of music titles: A large-scale
validation of the correction approach. IEEE Trans Audio Speech Lang
Process. 17(2), 335–343 (2009)
15. G Tsoumakas, I Katakis, I Vlahavas, Mining Multi-Label Data, Data Mining
and Knowledge Discovery Handbook, Part 6, O. Maimon L Rokach (Ed.),
Springer, 2nd edition, pp. 667–685, (2010)
16. DH Wolpert, Stacked generalization. Neural Netw. 5, 241–259 (1992).
doi:10.1016/S0893-6080(05)80023-1
17.
S Godbole, S Sarawagi, Discriminative methods for multi-labeled
classification, in Proceedings of the 8th Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD 2004),22–30 (2004)
18. G Tsoumakas, I Katakis, Multi-label classification: an overview. Int J Data
Warehousing Mining 3(3), 1–13 (2007)
19. G Tsoumakas, I Vlahavas, Random k-labelsets: An ensemble method for
multilabel classification, in Proceedings of the 18th European Conference on
Machine Learning (ECML 2007), Poland, 406–417 (2007)
20. E Hüllermeier, J Fürnkranz, W Cheng, K Bringer, Label ranking by learning
pairwise preferences. Artif Intell. 172(16-17), 1897–1916 (2008). doi:10.1016/j.
artint.2008.08.002
21. J Fürnkranz, E Hüllermeier, EL Mencia, K Brinker, Multilabel classification via

calibrated label ranking. Mach Learn. 73(2), 133–153 (2008). doi:10.1007/
s10994-008-5064-8
22. M-L Zhang, Z-H Zhou, Multi-label neural networks with applications to
functional genomics and text categorization. IEEE Trans Knowl Data Eng.
18(10), 1338–1351 (2006)
23. M-L Zhang, Z-H Zhou, ML-KNN: a lazy learning approach to multi-label
learning. Pattern Recog. 40(7), 2038–2048 (2007). doi:10.1016/j.
patcog.2006.12.019
24. PN Juslin, P Lukka, Expression, perception and induction of musical
emotions: a review and questionnaire study of every day listening. J New
Music Res 33, 217–238 (2004). (2004). doi:10.1080/0929821042000317813
25. PN Juslin, JA Sloboda (Eds.), Music and Emotion: Theory and Research.
(Oxford University Press, New York, 2001)
26. T Eerola, JK Vuoskoski, A comparison of the discrete and dimensional
models of emotion in music. Psychol Music. 39(1), 18–49 (2011).
doi:10.1177/0305735610362821
27. P Ekman, An argument for basic emotions, Cognition Emotion. 6, 169–200
(1992)
28. PR Farnsworth, A study of the Hevner adjective list. J Aesth Art Crit. 13,
97–103 (1954). doi:10.2307/427021
29. CE Izard, The Emotions (Plenum Press, New York, 1977)
30. R Plutchik, The Psychology and Biology of Emotion (Harper Collins, New York,
1994)
31. J Panksepp, A critical role for affective neuroscience in resolving what is
basic about basic emotions. Psychol Rev. 99, 554–560 (1992)
32. K Oatley, Best laid schemes. The Psychology of Emotions (Harvard University
Press, MA, 1992)
33. WM Wundt, Outlines of Psychology (Wilhelm Engelmann, Leipzig, 1897)
(Translated by CH Judd)
34. JA Russell, A circumplex model of affect. J Soc Psychol. 39, 1161–

1178
(1980)
35.
RE Thayer, The Biopsychology of Mood and Arousal (Oxford University Press,
1989)
36. D Tellegen, D Watson, LA Clark, On the dimensional and hierarchical
structure of affect. Psychol Sci. 10(4), 297–303 (1999). doi:10.1111/1467-
9280.00157
37. H Schlosberg, Three dimensions of emotion. Psychol Rev. 61(2), 81–88
(1954)
38. A Mehrabian, Pleasure-arousal-dominance: a general framework for
describing and measuring individual. Curr Psychol. 14(4), 261–292 (1996).
doi:10.1007/BF02686918
39. R Rosch, Cognition and categorization, in Principles of categorization, ed. by
Rosch E, Loyd BB (Hillsdale, NJ, 1978), pp. 27–48
40. B Logan, Mel frequency cepstral coefficients for music modeling, in
Proceedings of the 1st International Symposium on Music Information Retrieval
(ISMIR 2000), Plymouth, Massachusetts (2000)
41. P Kivy, Sound Sentiment. An Essay on the Musical Emotions (Temple
University Press, Philadelphia, PA, 1989)
42. MR Zentner, S Meylan, KR Scherer, Exploring ‘musical emotions’ across five
genres of music, in Proceedings of 6th International Conference of society for
Music Perception and Cognition (ICMPC) (2000)
43. LB Meyer, Emotion and Meaning in Music (University of Chicago Press,
Chicago, 1956)
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 8 of 9
44. G Tsoumakas, E Spyromitros-Xioufis, J Vilcek, I Vlahavas, Mulan: A Java
Library for Multi-Label Learning. J Mach Learn Res. 12, 2411–2414 (2001)
45. IH Witten, E Frank, Data Mining: Practical Machine Learning Tools and

Techniques. (Morgan Kaufmann, 2011)
46. A Gabrielsson, PN Juslin Emotional expression in music performance:
between the performers intention and the listeners experience. Psychol
Music. 24(1), 68–91 (1996). doi:10.1177/0305735696241007
47. L Krumhansl, An exploratory study of musical emotions and
psychophysiology. Can J Exp Psychol. 51, 336–353 (1997)
48. C Laurier, J Grivolla, P Herrera, Multimodal music mood classification using
audio and lyrics, in Proceedings of the International Conference on Machine
Learning and Applications, USA (2008)
49. Y-H Yang, Y-C Lin, H-T Cheng, I-B Liao, Y-C Ho, H-H Chen, Toward multi-
modal music emotion classification, in Proceedings of the 9th Pacific Rim
Conference on Multimedia (PCM 2008), pp. 70–79 (2008)
50. C Vens, J Struyf, L Schietgat, S Džeroski, H Blockeel, Decision trees for
hierarchical multi-label classification. Mach Learn. 73(2), 185–214 (2008).
doi:10.1007/s10994-008-5077-3
51. D Yang, W Lee, Disambiguating music emotion using software agents. in
Proceedings of the 5th International Conference on Music Information Retrieval
(ISMIR’04), Barcelona, Spain, (2004)
doi:10.1186/1687-4722-2011-426793
Cite this article as: Trohidis et al.: Multi-label classification of music by
emotion. EURASIP Journal on Audio, Speech, and Music Processing 2011
2011:4.
Submit your manuscript to a
journal and benefi t from:
7 Convenient online submission
7 Rigorous peer review
7 Immediate publication on acceptance
7 Open access: articles freely available online
7 High visibility within the fi eld
7 Retaining the copyright to your article

Submit your next manuscript at 7 springeropen.com
Trohidis et al. EURASIP Journal on Audio, Speech, and Music Processing 2011, 2011:4
/>Page 9 of 9

×