Comparing ontology matching systems are typically performed by comparing their average performances over multiple datasets. However, this paper examines the alignment systems using statistical inference since averaging is statistically unsafe and inappropriate. The statistical tests for comparison of two or multiple alignment systems are theoretically and empirically reviewed. For comparison of two systems, the Wilcoxon signed-rank and McNemar's mid-p and asymptotic tests are recommended due to their robustness and statistical safety in different circumstances. The Friedman and Quade tests with their corresponding post-hoc procedures are studied for comparison of multiple systems, and their [dis]advantages are discussed. The statistical methods are then applied to benchmark and multifarm tracks from the ontology matching evaluation initiative (OAEI) 2015 and their results are reported and visualized by critical difference diagrams.

Original languageEnglish
Pages (from-to)1-14
JournalIEEE Transactions on Knowledge and Data Engineering
Publication statusAccepted/In press - 29 May 2018

    Research areas

  • Benchmark testing, Bergmann, Friedman, Geoscience, Holm, McNemar, Nemenyi, Ontologies, Ontology alignment evaluation, paired t-test, post-hoc, Quade, Robustness, Shaffer, Statistical analysis, Task analysis, Wilcoxon signed-rank

ID: 45459279