Standard

A New Perspective on Score Standardization. / Urbano, Julián; De Lima, Harlley; Hanjalic, Alan.

SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval . New York, USA : ACM DL, 2019. p. 1061-1064 .

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Harvard

Urbano, J, De Lima, H & Hanjalic, A 2019, A New Perspective on Score Standardization. in SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval . ACM DL, New York, USA, pp. 1061-1064 , SIGIR 2019: the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval , Paris, France, 21/07/19. https://doi.org/10.1145/3331184.3331315

APA

Urbano, J., De Lima, H., & Hanjalic, A. (2019). A New Perspective on Score Standardization. In SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1061-1064 ). New York, USA: ACM DL. https://doi.org/10.1145/3331184.3331315

Vancouver

Urbano J, De Lima H, Hanjalic A. A New Perspective on Score Standardization. In SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval . New York, USA: ACM DL. 2019. p. 1061-1064 https://doi.org/10.1145/3331184.3331315

Author

Urbano, Julián ; De Lima, Harlley ; Hanjalic, Alan. / A New Perspective on Score Standardization. SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval . New York, USA : ACM DL, 2019. pp. 1061-1064

BibTeX

@inproceedings{d2ee6a8fb79d41b097311bdc333c831f,
title = "A New Perspective on Score Standardization",
abstract = "In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored relative to it. Webber et al. first proposed standardization through a non-linear transformation with the standard normal distribution, and recently Sakai proposed a simple linear transformation. In this paper, we show that both approaches are actually special cases of a simple standardization which assumes specific distributions for the per-topic scores. From this viewpoint, we argue that a transformation based on the empirical distribution is the most appropriate choice for this kind of standardization. Through a series of experiments on TREC data, we show the benefits of our proposal in terms of score stability and statistical test behavior.",
keywords = "Statistical significance,, Student’s t-test, Wilcoxon test, Sign test, Bootstrap, Permutation, Simulation, Type I and Type II errors",
author = "Juli{\'a}n Urbano and {De Lima}, Harlley and Alan Hanjalic",
year = "2019",
doi = "10.1145/3331184.3331315",
language = "English",
isbn = "978-1-4503-6172-9",
pages = "1061--1064",
booktitle = "SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval",
publisher = "ACM DL",

}

RIS

TY - GEN

T1 - A New Perspective on Score Standardization

AU - Urbano, Julián

AU - De Lima, Harlley

AU - Hanjalic, Alan

PY - 2019

Y1 - 2019

N2 - In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored relative to it. Webber et al. first proposed standardization through a non-linear transformation with the standard normal distribution, and recently Sakai proposed a simple linear transformation. In this paper, we show that both approaches are actually special cases of a simple standardization which assumes specific distributions for the per-topic scores. From this viewpoint, we argue that a transformation based on the empirical distribution is the most appropriate choice for this kind of standardization. Through a series of experiments on TREC data, we show the benefits of our proposal in terms of score stability and statistical test behavior.

AB - In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored relative to it. Webber et al. first proposed standardization through a non-linear transformation with the standard normal distribution, and recently Sakai proposed a simple linear transformation. In this paper, we show that both approaches are actually special cases of a simple standardization which assumes specific distributions for the per-topic scores. From this viewpoint, we argue that a transformation based on the empirical distribution is the most appropriate choice for this kind of standardization. Through a series of experiments on TREC data, we show the benefits of our proposal in terms of score stability and statistical test behavior.

KW - Statistical significance,

KW - Student’s t-test

KW - Wilcoxon test

KW - Sign test

KW - Bootstrap

KW - Permutation

KW - Simulation

KW - Type I and Type II errors

UR - https://github.com/julian-urbano/sigir2019-standardization

U2 - 10.1145/3331184.3331315

DO - 10.1145/3331184.3331315

M3 - Conference contribution

SN - 978-1-4503-6172-9

SP - 1061

EP - 1064

BT - SIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - ACM DL

CY - New York, USA

ER -

ID: 55636471