DOI

In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored relative to it. Webber et al. first proposed standardization through a non-linear transformation with the standard normal distribution, and recently Sakai proposed a simple linear transformation. In this paper, we show that both approaches are actually special cases of a simple standardization which assumes specific distributions for the per-topic scores. From this viewpoint, we argue that a transformation based on the empirical distribution is the most appropriate choice for this kind of standardization. Through a series of experiments on TREC data, we show the benefits of our proposal in terms of score stability and statistical test behavior.
Original languageEnglish
Title of host publicationSIGIR'19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Place of PublicationNew York, USA
PublisherACM DL
Pages1061-1064
Number of pages4
ISBN (Print)978-1-4503-6172-9
DOIs
Publication statusPublished - 2019
EventSIGIR 2019: the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval - Cité des Sciences, Paris, France
Duration: 21 Jul 201925 Jul 2019
https://sigir.org/sigir2019/

Conference

ConferenceSIGIR 2019: the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Abbreviated titleSIGIR '19
CountryFrance
CityParis
Period21/07/1925/07/19
Internet address

    Research areas

  • Statistical significance, Student’s t-test, Wilcoxon test, Sign test, Bootstrap, Permutation, Simulation, Type I and Type II errors

ID: 55636471