Statistical Analysis of Results in Music Information Retrieval: Why and How

Julián Urbano, Arthur Flexer

Research output: Contribution to conferenceAbstractScientific

78 Downloads (Pure)

Abstract

Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey’s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset.

But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data.

The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.
Original languageEnglish
Number of pages4
Publication statusPublished - 2018
EventISMR 2018: 19th International Society for Music Information Retrieval Conference - Paris, France
Duration: 23 Sept 201827 Sept 2018
Conference number: 19

Conference

ConferenceISMR 2018: 19th International Society for Music Information Retrieval Conference
Abbreviated titleISMR 2018
Country/TerritoryFrance
CityParis
Period23/09/1827/09/18

Bibliographical note

Accepted author manuscript

Fingerprint

Dive into the research topics of 'Statistical Analysis of Results in Music Information Retrieval: Why and How'. Together they form a unique fingerprint.

Cite this