Oracle Issues in Machine Learning and Where to Find Them

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

6 Citations (Scopus)
219 Downloads (Pure)

Abstract

The rise in popularity of machine learning (ML), and deep learning in particular, has both led to optimism about achievements of artificial intelligence, as well as concerns about possible weaknesses and vulnerabilities of ML pipelines. Within the software engineering community, this has led to a considerable body of work on ML testing techniques, including white- and black-box testing for ML models. This means the oracle problem needs to be addressed. For supervised ML applications, oracle information is indeed available in the form of dataset `ground truth', that encodes input data with corresponding desired output labels. However, while ground truth forms a gold standard, there still is no guarantee it is truly correct. Indeed, syntactic, semantic, and conceptual framing issues in the oracle may negatively affect the ML system's integrity. While syntactic issues may automatically be verified and corrected, the higher-level issues traditionally need human judgment and manual analysis.
In this paper, we employ two heuristics based on information entropy and semantic analysis on well-known computer vision models and benchmark data from ImageNet. The heuristics are used to semi-automatically uncover potential higher-level issues in (i) the label taxonomy used to define the ground truth oracle (labels), and (ii) data encoding and representation. In doing this, beyond existing ML testing efforts, we illustrate the need for software engineering strategies that especially target and assess the oracle.
Original languageEnglish
Title of host publicationProceedings of the 8th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE)
Pages483-488
Number of pages6
ISBN (Electronic)978-1-4503-7963-2
DOIs
Publication statusPublished - 2020
Event42nd International Conference on Software Engineering: ICSE 2020 - Seoul, Korea, Republic of
Duration: 27 Jun 202019 Jul 2020

Publication series

NameICSEW' 20

Conference

Conference42nd International Conference on Software Engineering
Country/TerritoryKorea, Republic of
CitySeoul
Period27/06/2019/07/20
OtherVirtual/online event due to COVID-19 online presentations

Bibliographical note

Virtual/online event due to COVID-19. Paper presented online 29-6-2020

Fingerprint

Dive into the research topics of 'Oracle Issues in Machine Learning and Where to Find Them'. Together they form a unique fingerprint.

Cite this