Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data

Alessio Bazzica

doi:10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6

Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data

Alessio Bazzica

Multimedia Computing

Research output: Thesis › Dissertation (TU Delft)

154 Downloads (Pure)

Abstract

Digital music platforms have recently become the primary revenue stream for recorded music, making record labels and content owners increasingly interested in developing new digital features for their users.
Besides listening to expert-curated playlists and automatically recommended music, users can also benefit from a more informative, non-linearly accessible experience accommodating multiple perspectives on the content.
To give some examples of such enriched experiences, an alternative version of a piece can automatically be suggested. Users can skip throughout a long classical music piece guided by a visualization of its structure (\eg movements, recurring themes). They can also switch viewpoints while watching a music video instead of sticking to the editor's choice.

Developing such features requires innovation of automated content-based methods that extract musical knowledge. Traditionally, Music Information Retrieval (Music IR) researchers have tackled this problem mostly from an audio-only perspective.
Several works have however shown that other types of data, such as social tags, listening behaviors, and symbolic music scores, can largely improve the performance of audio-only algorithms, or even enable tasks that cannot be solved at all using audio alone.

In this thesis, we focus on the relatively unexplored field of \textit{vision-based Music IR}, which studies how to analyze the visual channel accompanying a music recording in order to learn more about the music piece being performed.
Several existing methods require obtrusive settings, such as 3D motion capture systems, which are not applicable in professional environments (\eg during a live classical music concert). Other methods rely instead on favorable viewpoints, static cameras, and uniform backgrounds to simplify the musicians' movements analysis process.
In both cases, the devised algorithms may not be suitable for commercial music platforms, especially those dealing with \textit{real-world data} --- \ie \textit{unstructured} and \textit{unconstrained} music videos.
We therefore consider tasks, algorithms and datasets with the real-world data challenges in mind, advancing the state-of-the-art in two ways: (i) we investigate how to process videos of a single musician aiming to extract musically relevant cues that can be exploited to solve existing, as well as new, Music IR problems, and (ii) we address the challenging case of large ensembles, proposing a way to possibly parse complex scenes and link musician-wise cues to identity and instrumental part annotations.

More in detail, this thesis first presents a global motion feature which aims to represent musicians' movements over time.
While lightweight and instrument-generic, it shows limitations with camera motion.
For this reason, we switch to detecting ``play\-ing/non-playing'' (P/NP) labels, which can be guessed from different viewpoints and at different scales and they can be used to encode the instrumentation of a performance over time.
We first show the value of such semantic feature by proving that it allows to roughly synchronize a symbolic music score to a performance recording.
We then focus on the visual analysis of large classical music ensembles videos, presenting a semi-automatic framework for P/NP annotation.
The experiments show that video face clustering is a critical problem to solve; we therefore illustrate a novel method that exploits the \textit{quasi-static scene} properties of classical music videos to generate better face clusters by relying on an automatically built map of the scene.
Finally, we address the challenging problem of detecting note onsets for clarinetist videos as a case study for woodwind and brass instruments. We propose a novel convolutional network architecture based on multiple streams and absence of temporal pooling, aiming to capture the fine spatio-temporal information conveyed by finger movements.

Our proposed methods, outcomes, and envisioned applications show that real-world music videos are an unexploited asset rather than a problem to avoid.
Furthermore, the light this thesis sheds on vision-based Music IR gives various indications on where future Computer Vision and Music IR research agendas can meet, bringing further innovation to the digital music platforms market.

Original language	English
Awarding Institution	Delft University of Technology
Supervisors/Advisors	Hanjalic, A., Supervisor Liem, C.C.S., Supervisor
Award date	15 Dec 2017
Print ISBNs	978-94-6299-807-0
DOIs	https://doi.org/10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6
Publication status	Published - 2017

Keywords

music information retrieval
computer vision
cross-modal analysis

Access to Document

10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6

dissertation_webFinal published version, 51.9 MB

Cite this

@phdthesis{7ad1aef57ce1497294439f66a5c727f6,

title = "Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data",

abstract = "Digital music platforms have recently become the primary revenue stream for recorded music, making record labels and content owners increasingly interested in developing new digital features for their users.Besides listening to expert-curated playlists and automatically recommended music, users can also benefit from a more informative, non-linearly accessible experience accommodating multiple perspectives on the content.To give some examples of such enriched experiences, an alternative version of a piece can automatically be suggested. Users can skip throughout a long classical music piece guided by a visualization of its structure (\eg movements, recurring themes). They can also switch viewpoints while watching a music video instead of sticking to the editor's choice.Developing such features requires innovation of automated content-based methods that extract musical knowledge. Traditionally, Music Information Retrieval (Music IR) researchers have tackled this problem mostly from an audio-only perspective.Several works have however shown that other types of data, such as social tags, listening behaviors, and symbolic music scores, can largely improve the performance of audio-only algorithms, or even enable tasks that cannot be solved at all using audio alone.In this thesis, we focus on the relatively unexplored field of \textit{vision-based Music IR}, which studies how to analyze the visual channel accompanying a music recording in order to learn more about the music piece being performed.Several existing methods require obtrusive settings, such as 3D motion capture systems, which are not applicable in professional environments (\eg during a live classical music concert). Other methods rely instead on favorable viewpoints, static cameras, and uniform backgrounds to simplify the musicians' movements analysis process.In both cases, the devised algorithms may not be suitable for commercial music platforms, especially those dealing with \textit{real-world data} --- \ie \textit{unstructured} and \textit{unconstrained} music videos.We therefore consider tasks, algorithms and datasets with the real-world data challenges in mind, advancing the state-of-the-art in two ways: (i) we investigate how to process videos of a single musician aiming to extract musically relevant cues that can be exploited to solve existing, as well as new, Music IR problems, and (ii) we address the challenging case of large ensembles, proposing a way to possibly parse complex scenes and link musician-wise cues to identity and instrumental part annotations.More in detail, this thesis first presents a global motion feature which aims to represent musicians' movements over time.While lightweight and instrument-generic, it shows limitations with camera motion.For this reason, we switch to detecting ``play\-ing/non-playing'' (P/NP) labels, which can be guessed from different viewpoints and at different scales and they can be used to encode the instrumentation of a performance over time.We first show the value of such semantic feature by proving that it allows to roughly synchronize a symbolic music score to a performance recording.We then focus on the visual analysis of large classical music ensembles videos, presenting a semi-automatic framework for P/NP annotation.The experiments show that video face clustering is a critical problem to solve; we therefore illustrate a novel method that exploits the \textit{quasi-static scene} properties of classical music videos to generate better face clusters by relying on an automatically built map of the scene.Finally, we address the challenging problem of detecting note onsets for clarinetist videos as a case study for woodwind and brass instruments. We propose a novel convolutional network architecture based on multiple streams and absence of temporal pooling, aiming to capture the fine spatio-temporal information conveyed by finger movements.Our proposed methods, outcomes, and envisioned applications show that real-world music videos are an unexploited asset rather than a problem to avoid.Furthermore, the light this thesis sheds on vision-based Music IR gives various indications on where future Computer Vision and Music IR research agendas can meet, bringing further innovation to the digital music platforms market.",

keywords = "music information retrieval, computer vision, cross-modal analysis",

author = "Alessio Bazzica",

year = "2017",

doi = "10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6",

language = "English",

isbn = "978-94-6299-807-0",

type = "Dissertation (TU Delft)",

school = "Delft University of Technology",

}

TY - THES

T1 - Music Information Retrieval beyond Audio

T2 - A Vision-based Approach for Real-world Data

AU - Bazzica, Alessio

PY - 2017

Y1 - 2017

N2 - Digital music platforms have recently become the primary revenue stream for recorded music, making record labels and content owners increasingly interested in developing new digital features for their users.Besides listening to expert-curated playlists and automatically recommended music, users can also benefit from a more informative, non-linearly accessible experience accommodating multiple perspectives on the content.To give some examples of such enriched experiences, an alternative version of a piece can automatically be suggested. Users can skip throughout a long classical music piece guided by a visualization of its structure (\eg movements, recurring themes). They can also switch viewpoints while watching a music video instead of sticking to the editor's choice.Developing such features requires innovation of automated content-based methods that extract musical knowledge. Traditionally, Music Information Retrieval (Music IR) researchers have tackled this problem mostly from an audio-only perspective.Several works have however shown that other types of data, such as social tags, listening behaviors, and symbolic music scores, can largely improve the performance of audio-only algorithms, or even enable tasks that cannot be solved at all using audio alone.In this thesis, we focus on the relatively unexplored field of \textit{vision-based Music IR}, which studies how to analyze the visual channel accompanying a music recording in order to learn more about the music piece being performed.Several existing methods require obtrusive settings, such as 3D motion capture systems, which are not applicable in professional environments (\eg during a live classical music concert). Other methods rely instead on favorable viewpoints, static cameras, and uniform backgrounds to simplify the musicians' movements analysis process.In both cases, the devised algorithms may not be suitable for commercial music platforms, especially those dealing with \textit{real-world data} --- \ie \textit{unstructured} and \textit{unconstrained} music videos.We therefore consider tasks, algorithms and datasets with the real-world data challenges in mind, advancing the state-of-the-art in two ways: (i) we investigate how to process videos of a single musician aiming to extract musically relevant cues that can be exploited to solve existing, as well as new, Music IR problems, and (ii) we address the challenging case of large ensembles, proposing a way to possibly parse complex scenes and link musician-wise cues to identity and instrumental part annotations.More in detail, this thesis first presents a global motion feature which aims to represent musicians' movements over time.While lightweight and instrument-generic, it shows limitations with camera motion.For this reason, we switch to detecting ``play\-ing/non-playing'' (P/NP) labels, which can be guessed from different viewpoints and at different scales and they can be used to encode the instrumentation of a performance over time.We first show the value of such semantic feature by proving that it allows to roughly synchronize a symbolic music score to a performance recording.We then focus on the visual analysis of large classical music ensembles videos, presenting a semi-automatic framework for P/NP annotation.The experiments show that video face clustering is a critical problem to solve; we therefore illustrate a novel method that exploits the \textit{quasi-static scene} properties of classical music videos to generate better face clusters by relying on an automatically built map of the scene.Finally, we address the challenging problem of detecting note onsets for clarinetist videos as a case study for woodwind and brass instruments. We propose a novel convolutional network architecture based on multiple streams and absence of temporal pooling, aiming to capture the fine spatio-temporal information conveyed by finger movements.Our proposed methods, outcomes, and envisioned applications show that real-world music videos are an unexploited asset rather than a problem to avoid.Furthermore, the light this thesis sheds on vision-based Music IR gives various indications on where future Computer Vision and Music IR research agendas can meet, bringing further innovation to the digital music platforms market.

AB - Digital music platforms have recently become the primary revenue stream for recorded music, making record labels and content owners increasingly interested in developing new digital features for their users.Besides listening to expert-curated playlists and automatically recommended music, users can also benefit from a more informative, non-linearly accessible experience accommodating multiple perspectives on the content.To give some examples of such enriched experiences, an alternative version of a piece can automatically be suggested. Users can skip throughout a long classical music piece guided by a visualization of its structure (\eg movements, recurring themes). They can also switch viewpoints while watching a music video instead of sticking to the editor's choice.Developing such features requires innovation of automated content-based methods that extract musical knowledge. Traditionally, Music Information Retrieval (Music IR) researchers have tackled this problem mostly from an audio-only perspective.Several works have however shown that other types of data, such as social tags, listening behaviors, and symbolic music scores, can largely improve the performance of audio-only algorithms, or even enable tasks that cannot be solved at all using audio alone.In this thesis, we focus on the relatively unexplored field of \textit{vision-based Music IR}, which studies how to analyze the visual channel accompanying a music recording in order to learn more about the music piece being performed.Several existing methods require obtrusive settings, such as 3D motion capture systems, which are not applicable in professional environments (\eg during a live classical music concert). Other methods rely instead on favorable viewpoints, static cameras, and uniform backgrounds to simplify the musicians' movements analysis process.In both cases, the devised algorithms may not be suitable for commercial music platforms, especially those dealing with \textit{real-world data} --- \ie \textit{unstructured} and \textit{unconstrained} music videos.We therefore consider tasks, algorithms and datasets with the real-world data challenges in mind, advancing the state-of-the-art in two ways: (i) we investigate how to process videos of a single musician aiming to extract musically relevant cues that can be exploited to solve existing, as well as new, Music IR problems, and (ii) we address the challenging case of large ensembles, proposing a way to possibly parse complex scenes and link musician-wise cues to identity and instrumental part annotations.More in detail, this thesis first presents a global motion feature which aims to represent musicians' movements over time.While lightweight and instrument-generic, it shows limitations with camera motion.For this reason, we switch to detecting ``play\-ing/non-playing'' (P/NP) labels, which can be guessed from different viewpoints and at different scales and they can be used to encode the instrumentation of a performance over time.We first show the value of such semantic feature by proving that it allows to roughly synchronize a symbolic music score to a performance recording.We then focus on the visual analysis of large classical music ensembles videos, presenting a semi-automatic framework for P/NP annotation.The experiments show that video face clustering is a critical problem to solve; we therefore illustrate a novel method that exploits the \textit{quasi-static scene} properties of classical music videos to generate better face clusters by relying on an automatically built map of the scene.Finally, we address the challenging problem of detecting note onsets for clarinetist videos as a case study for woodwind and brass instruments. We propose a novel convolutional network architecture based on multiple streams and absence of temporal pooling, aiming to capture the fine spatio-temporal information conveyed by finger movements.Our proposed methods, outcomes, and envisioned applications show that real-world music videos are an unexploited asset rather than a problem to avoid.Furthermore, the light this thesis sheds on vision-based Music IR gives various indications on where future Computer Vision and Music IR research agendas can meet, bringing further innovation to the digital music platforms market.

KW - music information retrieval

KW - computer vision

KW - cross-modal analysis

UR - http://resolver.tudelft.nl/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6

U2 - 10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6

DO - 10.4233/uuid:7ad1aef5-7ce1-4972-9443-9f66a5c727f6

M3 - Dissertation (TU Delft)

SN - 978-94-6299-807-0

ER -

Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this