Decoding Music from Brain Activity: Exploring the Neural Correlates of Music Perception

Matteo Ferrante*, Matteo Ciferri*, Nicola Toschi

fMRI experiment

3D Isometric Flat Conceptual Illustration of MRI Tomography, Magnetic Resonance Imaging

5 participants listen to 540 songs while brain activity is recorded with fMRI

CLAP

model

An encoding model of brain activity was built to predict audio responsive regions from audio features extracted with CLAP model. The responsive regions were further be used as inputs for decoding models to decode music from brain activity.

Encoding Pipeline

Music Brain

Decoding model

Brain activity is decoded with a retrieval system that outputs musical genre and a candidate song

Decoding Pipeline

Abstract

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

The top-1 accuracy is 21% song-wise (compared to a 1.7% chance level), while it increases when computed genre-wise up to 50.3% (10% of chance level). The top-3 accuracy is 44% song-wise (compared to a 5.0% chance level), while it increases when computed genre-wise up to 74% (30% of chance level). Decoded track samples are shown below.

Jazz Stimulus

Decoded

Metal Stimulus

Decoded

Disco Stimulus

Decoded

Pop Stimulus

Decoded