Documents

Links

Training machine learning (ML) models for natural language processing usually requires large amount of data, often acquired through crowdsourcing. The way this data is collected and aggregated can have an effect on the outputs of the trained model such as ignoring the labels which differ from the majority. In this paper we investigate how label aggregation can bias the ML results towards certain data samples and propose a methodology to highlight and mitigate this bias. Although our work is applicable to any kind of label aggregation for data subject to multiple interpretations, we focus on the effects of the bias introduced by majority voting on toxicity prediction over sentences. Our preliminary results point out that we can mitigate the majority-bias and get increased prediction accuracy for the minority opinions if we take into account the different labels from annotators when training adapted models, rather than rely on the aggregated labels.
Original languageEnglish
Title of host publicationProceedings of the 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and Short Paper Proceedings of the 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management
EditorsLora Aroyo, Anca Dumitrache, Praveen Paritosh, Alex Quinn, Chris Welty, Alessandro Checco, Gianluca Demartini, Ujwal Gadiraju, Cristina Sarasua
PublisherCEUR
Pages67-71
Number of pages5
Volume2276
Publication statusPublished - 2018
Event1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management - University of Zurich, Zurich, Switzerland
Duration: 5 Jul 2018 → …
https://sites.google.com/view/crowdbias

Publication series

NameCEUR Workshop Proceedings
Volume2276
ISSN (Electronic)1613-0073

Conference

Conference1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing, and 1st Workshop on Disentangling the Relation Between Crowdsourcing and Bias Management
Abbreviated titleSAD2018 CrowdBias2018
CountrySwitzerland
CityZurich
Period5/07/18 → …
Internet address

    Research areas

  • dataset bias, Machine Learning fairness, crowdsourcing, annotation aggregation

ID: 51570452