Statistical Analysis of Results in Music Information Retrieval: Why and How

Julián Urbano; Arthur Flexer

Abstract

Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey’s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset.

But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data.

The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.

Original language	English
Number of pages	4
Publication status	Published - 2018
Event	ISMR 2018: 19th International Society for Music Information Retrieval Conference - Paris, France Duration: 23 Sept 2018 → 27 Sept 2018 Conference number: 19

Conference

Conference	ISMR 2018: 19th International Society for Music Information Retrieval Conference
Abbreviated title	ISMR 2018
Country/Territory	France
City	Paris
Period	23/09/18 → 27/09/18

Bibliographical note

Accepted author manuscript

Access to Document

067-statistical-analysis-results-music-information-retrieval-why-howFinal published version, 98.3 KBLicence: CC BY

Cite this

@conference{674c0f03cf994d6db8447afa620495bf,

title = "Statistical Analysis of Results in Music Information Retrieval: Why and How",

abstract = "Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey{\textquoteright}s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset.But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data.The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.",

author = "Juli{\'a}n Urbano and Arthur Flexer",

note = "Accepted author manuscript; ISMR 2018: 19th International Society for Music Information Retrieval Conference, ISMR 2018 ; Conference date: 23-09-2018 Through 27-09-2018",

year = "2018",

language = "English",

}

TY - CONF

T1 - Statistical Analysis of Results in Music Information Retrieval: Why and How

AU - Urbano, Julián

AU - Flexer, Arthur

N1 - Conference code: 19

PY - 2018

Y1 - 2018

N2 - Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey’s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset.But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data.The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.

AB - Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey’s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset.But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data.The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.

UR - http://ismir2018.ircam.fr/pages/events-tutorial-17.html

M3 - Abstract

T2 - ISMR 2018: 19th International Society for Music Information Retrieval Conference

Y2 - 23 September 2018 through 27 September 2018

ER -