Documents

DOI

  • Yazhou Yang
In recent decades, the availability of a large amount of data has propelled the field of machine learning enormously. Machine learning, however, relies heavily on the availability of annotated data, typically labels indicating to which class a data instance belongs. With the huge amounts of data, this raises the question of how to efficiently annotate data, certainly when having limited resources. This thesis addresses the particular challenge of using as few annotations as possible, while at the same time, maintaining a good learning performance. For that we utilize active learning, which iteratively chooses the most valuable instances as to obtain the labels froman oracle (e.g. a human expert). Though many studies have demonstrated that active learning can reduce the annotation cost, there are still several issues that limit its practical use. This thesis makes a further step forwards making active learning more practical for real-world applications.
We first provide a benchmark and comparison of six different categories of active learning algorithms built on logistic regression. This work provides a better understanding of the underlying characteristics of various active learners and illustrates the potential benefits of using such techniques, but it also provides many cases for which active learning fails to outperform passive learning (i.e. randomly selecting instances for labeling). Those failed cases motivate us to propose two novel active learning methods that show a clear advantage over passive learning. The first one proposes to weight the so-called retraining-based criteria with an uncertainty score that is measured by the estimated posterior probability. The second one measures the usefulness of unlabeled instances according to the variance of the predictive probability. This method takes an additional step towards practical active learning, clearly outperforming current state of the art on binary andmulti-class classification tasks.
We further consider two realistic issues when applying active learning to real-world problems. One is how to find an initial set that contains at least one instance per class to start the active labeling cycle. The other one is dealing with the absence of human annotators in the interactive labeling loop. We propose new approaches to tackle the above problems and observe good performance compared to existing methods. This thesis concludes with an analysis of the contributions and limitations of our work, as well as research directions that deserve further studies.
We hope that this thesis also inspires others to make active learning more suitable for real-world applications.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
Supervisors/Advisors
Thesis sponsors
  • Chinese Scholarship Council
Award date20 Nov 2018
Print ISBNs978-94-6380-102-7
DOIs
Publication statusPublished - 2018

ID: 47384205