key: cord-0679257-10x96dvd
authors: Vaswani, Kunal; Agrawal, Yudhik; Alluri, Vinoo
title: Multimodal Fusion Based Attentive Networks for Sequential Music Recommendation
date: 2021-10-03
journal: nan
DOI: nan
sha: 2e8ebc46a4ee3916a0389a3082e62a988b134249
doc_id: 679257
cord_uid: 10x96dvd

Music has the power to evoke intense emotional experiences and regulate the mood of an individual. With the advent of online streaming services, research in music recommendation services has seen tremendous progress. Modern methods leveraging the listening histories of users for session-based song recommendations have overlooked the significance of features extracted from lyrics and acoustic content. We address the task of song prediction through multiple modalities, including tags, lyrics, and acoustic content. In this paper, we propose a novel deep learning approach by refining Attentive Neural Networks using representations derived via a Transformer model for lyrics and Variational Autoencoder for acoustic features. Our model achieves significant improvement in performance over existing state-of-the-art models using lyrical and acoustic features alone. Furthermore, we conduct a study to investigate the impact of users' psychological health on our model's performance.

Music subscription platforms have seen a massive increase in subscriptions amid the Covid outbreak. With the rise of isolation, music has given societies a means to cope with adversity. One of the primary areas these platforms have targeted to enhance their user engagement is Big Data. Big Music Data includes massive song libraries and the temporal engagement obtained from users' listening patterns. Researchers in music recommendations have refined their algorithms to create a more compelling user experience thanks to the enormous amount of data produced by music streaming services. Recommendation systems (RS) play a critical role in improving the user experience and increasing user growth in these platforms. Collaborative Filtering (CF) and Content-Based (CB) approaches have gained prominence in music RS. CF techniques utilized in the past rely on combining users' preferences by using user-item interactions [1] but suffer from cold-start problems like working with new inputs. On the other hand, CB methods have gained popularity in recent years since they use track-based content for suggestions and cope with cold-start problems. Music possesses representation in predominantly three modalities, including acoustic content, user-defined tags comprising song/track descriptors (ex: artist, genre, mood, instruments), and lyrics. Music RS has been developed in the field of Music Information Retrieval (MIR) research, typically utilizing information from either an individual modality (ex: only tags) [2] or at most two modalities (ex: acoustic features + text) [3] . However, scarce studies have been done in creating a multimodal system that capitalizes on joint information across modalities.

The majority of Music RS have relied on acoustic features for their design [4] , [5] . Acoustic features also play a vital role in delivering enriched musical representations as shown in studies like emotion recognition [6] . Furthermore, users' preferred musical genres' acoustic characteristics are known to influence the songs they download in their non-preferred musical forms [7] . Emotions and user traits are essential factors in musical tastes and hence necessary for music recommendations [8] - [10] . We provide a novel method for incorporating acoustic features into downstream tasks by generating latent representations with a variational autoencoder. Recent MIR studies based on lyrics in extracting emotions [11] have revealed the depth of knowledge embedded in lyrics. Advanced neural techniques like transformers that do not limit themselves to extracting shorter patterns in lyrical data are the critical reason for progress in these approaches. These have been significantly under-explored in music RS, we approach this by embedding information from lyrics using sentence transformers. In addition to lyrics and acoustic features, another modality represented by user-defined tags has provided a means to consolidate user interests in Music RS [2] . Surana et al. [12] , [13] demonstrate how user-specific states, represented by psychological well-being, modulate musical choices characterized by tags and acoustic features on online streaming platforms. However, as mentioned before, there is a dearth of studies combining these modalities to model user behavior to create highly personalized Music RS.

A key step in the design of a Music RS is to dynamically predict the user's current song preferences based on previous listening history. Aside from the multimodal characteristics that can be used to make a Music RS, integrating users' temporal engagement is a challenging task in itself. Deep learning algorithms have become influential in predicting songs based on a user's listening history. Although other architectures, such as CNNs, have been shown to do well in deep learning tasks, we prefer to use Attentive Networks with Recurrent Neural Units because of their superior ability to collect sequential information. With the increase in modalities for our task, we chose the path of multimodal fusion [14] , [15] , where individual models first focus on extracting features from individual modalities to be subsequently combined via a deep learning architecture to predict song preferences. To demonstrate the effect of individual states on the performance of our model, for the first time, we perform a case study by evaluating our developed model on users categorized to be at risk for depression. This work thereby also highlights the importance of incorporating individual differences in future Music RS. We evaluate the effectiveness of our approach by using a large dataset from Last.fm 1 , which included 413k unique songs from 541 users, and compare it with state-ofthe-art models.

Collaborative recommendation techniques rely on users' ratings of various tracks in the system. The cold start problem is one of the primary challenges that early collaborative systems [16] in music recommendations encounter. When a new track is uploaded to the system, the algorithm finds it challenging to propose it to users as there are fewer interactions between users and the track. The problem becomes incredibly difficult with big platforms like Spotify, with over 70 million tracks 2 and users interacting with only a tiny percentage of them. Owing to the cold-start problems faced by collaborative filtering, content-based approaches have gained popularity in music RS. These approaches use the data from music content to produce latent item vectors, which are then used in collaborative filtering methods or to create hybrid models. For example, convolutional neural networks were used by van den Oord et al. [17] to predict latent factors from audio signals and compared with traditional bag-of-words models. CNNs in their experiments show superior performance on a music recommendation task, thus demonstrating the advantage of the latest deep learning techniques over traditional approaches. Currently, one commonly used approach to get acoustic features is to use Spotify's API 3 . For example, Zangerle et al. [4] integrate Spotify audio features in their model to describe the musical preferences of users. Audio features have also been used in conjunction with other modalities; Oramas et al. [3] use CNNs to create track embeddings from audio signals and fuse them with artist embeddings created from artist biographies. They show performance improvement compared to using only artist embeddings or track embeddings, demonstrating the importance of fusing modalities.

The fast-growing advancements in Natural Language Processing techniques to create representations of lyrics and other text data (ex: user reviews, artist biographies) are gaining importance. For example, Lin et al. [18] use textual embeddings in their architecture, but their approach is limited to traditional paragraph vectors [19] . Similarly, Gossi and Gunes [20] use early TF-IDF methods to model lyrical data. Vystrčilová and Peška [21] provide a comparative study on the usage of various NLP techniques to capture lyrical embeddings. Though Vystrčilová and Peška use some of the latest techniques for extracting lyrical features, they do not use deep learning architectures for song predictions. Nonetheless, advanced NLP techniques like transformer models [22] , [23] that can catch long-term dependencies in text, make them better suited to handling lyrical data, and remain unexplored in deep learningbased music RS.

We tend to listen to music in a certain order interspersed with periods of inactivity. The listening periods of users surrounded by their periods of inactivity are called sessions. The user's states and traits influence these patterns. For example, individuals scoring high on trait neuroticism, also characterized by their high psychologically distressed states, have been found to engage in repetitive listening patterns across sessions [13] . Hence, the temporal evolution of music consumption captures relevant user-specific preferences. Sequence-aware RS is an apt choice for this since it incorporates temporal dependencies and benefits from sessionbased patterns. Session-based approaches have been used to model sequential data, which work by feeding user clicks into RNNs, generating predictions based on the users' previous clicks [24] . Furthermore, it is recognized that Attentive Neural Nets enhance efficiency over these by paying attention to the user's session history; Attention is applied to the RNN outputs produced by sessions. These were observed in the experiments conducted by Sachdeva et al. [2] , and Lin et al. [18] . Sachdeva et al. focus only on tag modalities and use one-hot encoded inputs for the same. They generate tag representations all while performing song predictions and train their architecture endto-end. This poses difficulties when working with Big Data and limits the scalability of the approach. Experiments by Lin et al. provide a direction to resolve these issues using latent representations from graphic, textual, and visual data. However, they do not use more relevant data for extracting song features such as acoustic content. Also, we demonstrate how advanced techniques can be used to generate representations for the modalities they utilize, such as lyrics.

Based on the existing research mentioned above, we focus on utilizing various song representations using a multimodal fusion approach. First, we leverage sessions instead of using users' entire listening history to generate sequential representations. Further, we take advantage of individual modalities, such as acoustic features and undervalued lyrics, to show their impact and conduct experiments on song predictions by combining them with sequential representations. Eventually, we fuse information coming from all the modalities to obtain a multimodal architecture. We also present a comparative study in which we demonstrate the value of information from each model and whether any complementary information exists that could benefit the fusion of these models. Finally, since user states modulate musical choices, we further evaluate our model on two groups of individuals: the ones at risk (At-Risk) for depression and those not at risk (No-Risk). Owing to the aforementioned repetitive listening patterns exhibited by atrisk individuals, we predict that our model would demonstrate higher prediction accuracy for them when compared to the No-Risk group.

Data was obtained from a previous study [12] comprising listening histories of 541 Last.fm users (82 females, mean age = 25.4 years, std = 7.3 years). Most of them belonged to the United States and the United Kingdom accounting for about 30% and 10% of the participants respectively. Every other country contributed to less than 5% of the total participants. The participants' listening histories were extracted for a duration of 6 months extracted around the time they took part in the survey which comprised collecting their well-being scores. The respective users' well-being was assessed using a standard diagnostic questionnaire (Kessler's Psychological Distress Score or K-10) [25] along with personality information assessed using the Big Five model. K-10 is a distress scale and is used to evaluate depression and anxiety symptoms. Individuals were divided into two groups, "At-risk" and "Norisk," based on the K-10 scale, to assess depression risks. Those classified as "At-risk" have a K-10 score of 29 or more, while those classified as "No-risk" have a score of less than 20. Additional measures include musical engagement which describes music consumption behavior that is yet another indirect measure of well-being. Personality data and musical engagement are not used in the current analysis, but they were in the original study to assess internal consistency, which was found to be high.

Sessions: We used sessions as the basis for time resolution since we wanted to capture variations in users' temporal engagement. A session is defined as a period of continuous listening activity, surrounded by periods of inactivity of a minimum of two hours. The concept of sessions was used since people's preferences may differ during different sessions, and recommendations based on those may not be useful. [26] Sessions with less than five songs were discarded. Further statistics such as the number of sessions, the number of unique songs, and the average length of a session are included in Table I .

Track Embeddings: Each track 4 in the session consists of user ID, song name, artist name, and time stamp. The listening history of all the users is used to prepare track embeddings, which are used as initialization in our architecture's embedding layer. Our approach for obtaining these is based on the CBOW model by Mikolov et al. [27] . We try to employ a strategy similar to Word2Vec's by grouping songs that are frequently listened to together by users. The objective of this was to put songs that are likely to co-occur in a session together in an embedding space before passing them to the GRU network. A variety of models pursuing this approach have shown performance improvement [28] . We use the gensim [29] library with their default parameters for Word2Vec. 4 We use the terms song and track interchangeably. Acoustic Embeddings: For each track, 11 audio features are obtained using the Spotify API. These features include Acousticness (probability that a track is acoustic); Danceability (how suitable a track is for dancing); Duration (of the track); Energy (perceptual measure of intensity and activity in a track); Instrumentalness (measure of a track containing no vocals); Liveness (probability that the track was performed in the presence of an audience); Loudness (average loudness of the track in decibel); Speechiness (presence of spoken words in a track); Tempo (pace of the track in beats per minute(BPM)); Valence (pleasantness conveyed by a track); Mode (major or minor). Instead of passing acoustic features to linear layers and learning the weights of the layers end-to-end, we generate latent vectors for them using an unsupervised learning task. A Variational Autoencoder (VAE) is used to extract acoustic embeddings. VAEs improve upon conventional autoencoders by producing a continuous latent space, and thus decoding any point from this space could create an acceptable representation that resembles the input [30] . The significance of this in our architecture is demonstrated in section III-C. For any given track, we project its 11-dimensional feature vector s i to a 150-dimensional latent space z i , which is used as acoustic embedding for the song. Let us establish some notation that could be used later.

The encoder and decoder used in Fig. 1 are represented using the function G e and G d respectively. A loss is computed for the reconstructed vector y i and the original vector s i , which is used for training.

Lyrical Embeddings: To obtain lyrical features, we first extract the lyrics of each track using the Genius API 5 . The API needs to be constructed using the correct artist and track name, so we use a method of added web crawler to obtain the Genius website URL for the song's lyrics instead of hard-coding the artist and track name in Genius API. We use the Sentence-BERT [31] model for the computation of lyrical embeddings. Sentence-BERT employs siamese and triplet networks to give semantically relevant sentence embeddings, in addition to the benefits of transformer models. 768-dimensional embeddings are produced for song lyrics, then streamlined to 150 using Principle Component Analysis. Due to certain songs having only acoustic content and others lacking lyrical data, lyrics were only accessible for roughly 80% of the tracks. We pass zero-vector as embedding for songs without lyrics.

Tag Embeddings: Last.fm tags were extracted via the Last.fm API for individual tracks. 300-dimensional FastText embeddings [32] were used, which were then reduced to 150 using linear layers for individual words in the tags. The embedding for all the words in the tags of the track was then averaged to create a song's tag embedding. Since tags were not available for all of the tracks in the dataset, we create a subset dataset with only those tracks with available tags and obtain the results using it. Table I shows a description of our dataset along with the mentioned subset dataset. 

Song Prediction Task: Our models are fed a series of tracks in the order {x 1 , x 2 ...x n }, along with their attributes (lyrical, acoustic, tags), and our goal is to predict relevant songs for the user. Fig. 2 illustrates our proposed architecture explained in this section (using lyrical and acoustic embeddings). The architecture starts with one hot encoding of tracks {x 1 , x 2 ...x n }, which are passed to our first embedding layer E 1 . This embedding layer is initialized from the track embeddings {e 1 , e 2 ...e n } obtained in section III-B.

These are then passed to a BiGRU to produce two bidirectional hidden states (which we add to obtain a single vector h) for each time step. Additive attention [33] is used on the hidden states to obtain a set of attention weights {α 1 , α 2 ...α n }. We enable our first context vector to focus on sequential information by computing it from the weighted sum of hidden states using these weights.

Further, acoustic embeddings {z 1 , z 2 ...z n } along with lyrical embeddings {l 1 , l 2 ...l n } are incorporated into the architecture. These are used as initializations of our second (E 2 ) and third (E 3 ) embedding layers respectively.

The initializations of all three embedding layers are fine-tuned during training. Here we argue the importance of using VAE, the new fine-tuned representations produced by the embedding layer would be perfectly valid due to being in a continuous space produced by the VAE [30] . Finally, separate context vectors are generated as a weighted sum of these features using the previously produced attention weights.

The context vectors are concatenated and then passed through a fully connected (FC) layer with a Leaky ReLU activation to obtain a smaller dimensional vector c. Finally, an output vector with a size equal to the number of songs in the vocabulary is obtained by passing c to another fully connected layer and a softmax operation. This output vector gives us probabilities for all the tracks which are used for recommending the next track.

In the results, various combinations of modalities are examined. As a reference, we have created a set of notations for them here. The Attention architecture using only the context vector from the track embeddings (obtained using Word2Vec) is described as ANN-Word2Vec (ANNW). Next, the architectures in which context vectors from acoustic and lyrical embeddings are combined individually are denoted as ANNW + Acoustic and ANNW + Lyrics. Finally, we denote our complete architecture as ANNW + Acoustic + Lyrics. We use the procedure described in (5) to add another context vector for tag embeddings and concatenate it with the rest. This model's results are provided only for the subset dataset due to the unavailability of tags.

Network training: We use Nvidia's GTX 1080Ti, with 11GB of VRAM to train our models. The embeddings, GRU hidden vectors, and latent vectors generated in VAEs each have a size of 150. The size of the smaller hidden vector produced in (6) was kept as 256. The size of one-hot encoded song inputs was kept equal to the number of tracks in our complete dataset. We use the Adam optimizer [34] with an initial learning one hot encoded song inputs 2 . Overview of our proposed music recommendations architecture (ANN-Word2Vec + Acoustic + Lyrics). Given a series of tracks x i , we first pass them through our first embedding layer E 1 (track embeddings) which are then passed through BiGRU to produce hidden states h i . Then, the attention layer generates attention weights α i using the hidden states. VAE Encoder and Sentence BERT models are used to generate representations from acoustic features (s i ) and Lyrics, which are used as initializations in Embedding layers E 2 and E 3 respectively. Concurrently, the sequence of tracks is passed from Embedding layers E 2 (acoustic embeddings) and E 3 (lyrical embeddings) to generate z i and l i respectively. Subsequently, individual context vectors are produced using the previously generated attention weights. The concatenated context vector is used for recommending the next track. rate of 1e −3 and Cross-Entropy Loss to train the complete architecture. A batch size of 32 and 0.2 dropout regularization was used for the embedding layers. PyTorch [35] was used for the implementation of the complete architecture.

Evaluation Metrics: To establish a fair comparison, we used the same measures as provided by Sachdeva et al. [2] . The training data was formed from the first 70 percent of sessions for each user, in order of occurrence, and the remaining 30 percent was used for testing. We iterate through the listening histories of the users and use the next song for evaluation while giving songs till that point as input. The evaluation metric for all the models is HitRatio@k [36] , where k is the number of songs predicted and a hit is whether the required song is in the prediction set. We're attempting to predict top-k items for users in the context of recommendation systems; thus we chose this metric to evaluate sessions obtained from individual user's listening histories.

The training sessions were used for pre-training track em- 

In this section, we show a comprehensive evaluation of the proposed model and benchmark against recent state-of-the-art optimizations and deep learning-based algorithms. All of the trained models and code shall be made publicly available.

The following is a list of the baseline models we choose to compare our model to, as well as the reasoning behind our decision.

1) GRU4REC: We use the implementation of the model proposed by Hidasi et al. [24] . Similar to our approach, they used GRU-based RNNs to model sessions of users' listening histories. The input to the model is the one-hot encoding of tracks. 2) ANN: In line with the work done by Sachdeva et al. [2] , a base architecture was created using only sequential knowledge from user histories. The network's input is a one-hot encoding of the songs fed to an embedding layer, which generates song embeddings. Finally, these embeddings are transferred to a BiGRU and an attention layer on top of it to obtain a context vector. We adopt this as our baseline model since their work in using Attentive Networks for song predictions is close to ours. 3) ANN-LSA: To contrast the impact of Word2Vec initializations, we compare it with another technique to create track embeddings. First, a session-track matrix is created for a combined set of all the sessions of each user. Each row indicates the frequencies of songs in that session. Owing to the sparsity of this matrix, we perform latent semantic analysis (LSA), which finds a low-rank approximation for the same using singular value decomposition. Finally, the column vectors of this reduced matrix are used as track embeddings and replaced with embeddings obtained by Word2Vec.

The results in Table II demonstrate the far superior performance of our method on the testing data when compared to studies that have attempted the same task. Our model shows 35% improvement over the state-of-the-art ANN approach [2] . The numbers in the table correspond to our evaluation metric (HitRatio@k), which shows that as k increases, the model's prediction accuracy improves due to larger prediction sets.

We investigate several combinations of the modalities to gather a better understanding of our proposed architecture. The results of the experiments on our evaluation metric (HitRatio@k) can be found in Table III .

Complete Dataset: To begin with, ANNW's better performance in producing track embeddings justifies its advantage over other techniques such as LSA. This can be reasoned out by the fact that Word2Vec embeddings used in ANNW naturally preserve more information about tracks near to them, which is more suitable for a sequential music recommendation task. Following that, we present individual findings for lyrical and acoustic modalities, with lyrics outperforming acoustic modalities. The rise in the accuracy of individual models demonstrates the importance of integrating modalities in sequential music recommendation models since modalities encapsulate additional information absent from song embedding structures. Finally, the performance of models using individual modalities was inferior to a fusion model that incorporates all of them. This illustrates that, while lyrical and acoustic modalities might have some overlapping information, they also provide complementary information which should be used in fusion models.

Subset Dataset: The experiments were also conducted on the subset dataset to fuse tag-based modality. Incorporating yet another modality yielded even better results. We did not propose our final model, depicted in Fig. 2 , to be one with tags incorporated, for added modality, owing to the fact that the workable dataset shrinks a lot. Hence, we provide sufficient results which demonstrate the efficacy of the model with and without tags-based information.

We perform a study to see the influence of users' psychological health on the performance of our model; for this, we divide users into two groups: At-risk and No-risk. Following the approach used in Surana et al. [13] , users with K-10 score less than 20 were classified as No-risk, while those with K-10 score greater than 29 were classified as At-risk. This resulted in 142 At-risk users and 193 No-risk users. To provide a balanced comparison, we further show evaluations on the 142 No-risk users sorted in decreasing order of K-10 values. Owing to its superior performance, the model (ANNW + Acoustic + Lyrics) was used to train each set of users. The results for the performed study can be found in Table IV. As hypothesized, the model trained on At-risk users demonstrates higher prediction accuracy than both sets of Norisk users. This highlights the disparity in listening behavior between the two groups; future research might look for more definitive trends in this direction.

Our work in this paper focuses on using multiple modalities for sequential music recommendations and outperforms existing state-of-the-art models. While using embeddings from a single modality, results demonstrate that using lyrics alone performs better than other modalities. Fusing modalities proves to be the best, as demonstrated by the performance of the ANNW + Acoustic + Lyrics + Tags model, albeit on the subset dataset. There are certain limitations to designing a multimodal RS. Firstly, although majority of the tracks in an individual's listening history comprise lyrics, not all contain lyrics. Despite this, our approach demonstrates superior performance. Another limitation is that user-defined tags are not available for every track. One way to circumvent this is to use NLP techniques such as Topic Modelling, Sentiment Analysis, to extract relevant information from lyrics. This further underlines the importance of incorporating lyrics in designing Music RS. Furthermore, using acoustic features to identify genres and emotions can provide additional tags in cases of missing lyrics and tags. In the future, it remains to be seen if tags generated via lyrics or acoustic features give comparable results. Finally, our case study on users categorized as Atrisk for depression highlights the importance of individual differences in online music consumption. The higher accuracy can be attributed to repetitive listening behavior associated with the At-risk group, as demonstrated by several studies [13] , [37] . This emphasizes the importance of integrating user states and traits into recommendation systems in order to create more personalized recommendations.

Social information filtering: Algorithms for automating "word of mouth

Attentive neural architecture incorporating song features for music recommendation

A deep multimodal approach for cold-start music recommendation

Culture-aware music recommendation," ser. UMAP '18

A music recommendation system based on acoustic features and user personalities

Multi-Modal Music Emotion Recognition: A New Dataset, Methodology and Comparative Analysis

Acoustic features influence musical choices across multiple genres

Peia: Personality and emotion integrated attentive model for music recommendation on social media platforms

Linking music listening on spotify and personality

Personality traits and music genres: What do people prefer to listen to?" ser. UMAP '17

Transformer-based approach towards music emotion recognition from lyrics

Tag2Risk: Harnessing social music tags for characterizing depression risk

Static and Dynamic Measures of Active Music Listening as Indicators of Depression Risk

Multimodal emotion recognition in polish (student consortium)

Multidomain multimodal fusion for human action recognition using inertial sensors

Towards more conversational and collaborative recommender systems

Deep content-based music recommendation

Heterogeneous knowledge-based attentive neural networks for short-term music recommendations

Distributed representations of sentences and documents

Lyric-Based Music Recommendation

Lyrics or audio for music recommendation

BERT: Pretraining of deep bidirectional transformers for language understanding

Attention is all you need

Session-based recommendations with recurrent neural networks

Short screening scales to monitor population prevalences and trends in non-specific psychological distress

Explicit Modelling of the Implicit Short Term User Preferences for Music Recommendation

Efficient estimation of word representations in vector space

Contextual and sequential user embeddings for largescale music recommendation

Software Framework for Topic Modelling with Large Corpora

Auto-Encoding Variational Bayes

Sentence-bert: Sentence embeddings using siamese bert-networks

Enriching word vectors with subword information

Neural machine translation by jointly learning to align and translate

Adam: A method for stochastic optimization

Pytorch: An imperative style, highperformance deep learning library

Exploiting Contextual Information from Event Logs for Personalized Recommendation

Development and validation of the h ealthy-u nhealthy m usic s cale