Towards Practical Active Learning for Classification

Yazhou Yang

doi:10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0

Towards Practical Active Learning for Classification

Yazhou Yang

Pattern Recognition and Bioinformatics

Research output: Thesis › Dissertation (TU Delft)

268 Downloads (Pure)

Abstract

In recent decades, the availability of a large amount of data has propelled the field of machine learning enormously. Machine learning, however, relies heavily on the availability of annotated data, typically labels indicating to which class a data instance belongs. With the huge amounts of data, this raises the question of how to efficiently annotate data, certainly when having limited resources. This thesis addresses the particular challenge of using as few annotations as possible, while at the same time, maintaining a good learning performance. For that we utilize active learning, which iteratively chooses the most valuable instances as to obtain the labels froman oracle (e.g. a human expert). Though many studies have demonstrated that active learning can reduce the annotation cost, there are still several issues that limit its practical use. This thesis makes a further step forwards making active learning more practical for real-world applications.
We first provide a benchmark and comparison of six different categories of active learning algorithms built on logistic regression. This work provides a better understanding of the underlying characteristics of various active learners and illustrates the potential benefits of using such techniques, but it also provides many cases for which active learning fails to outperform passive learning (i.e. randomly selecting instances for labeling). Those failed cases motivate us to propose two novel active learning methods that show a clear advantage over passive learning. The first one proposes to weight the so-called retraining-based criteria with an uncertainty score that is measured by the estimated posterior probability. The second one measures the usefulness of unlabeled instances according to the variance of the predictive probability. This method takes an additional step towards practical active learning, clearly outperforming current state of the art on binary andmulti-class classification tasks.
We further consider two realistic issues when applying active learning to real-world problems. One is how to find an initial set that contains at least one instance per class to start the active labeling cycle. The other one is dealing with the absence of human annotators in the interactive labeling loop. We propose new approaches to tackle the above problems and observe good performance compared to existing methods. This thesis concludes with an analysis of the contributions and limitations of our work, as well as research directions that deserve further studies.
We hope that this thesis also inspires others to make active learning more suitable for real-world applications.

Original language	English
Qualification	Doctor of Philosophy
Awarding Institution	Delft University of Technology
Supervisors/Advisors	Loog, M., Supervisor Reinders, M.J.T., Supervisor
Thesis sponsors	Chinese Scholarship Council
Award date	20 Nov 2018
Print ISBNs	978-94-6380-102-7
DOIs	https://doi.org/10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0
Publication status	Published - 2018

Access to Document

10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0

YazhouYang_thesis_V2Final published version, 2.23 MB

Cite this

@phdthesis{12720d568a354b36b2872e301ae69bd0,

title = "Towards Practical Active Learning for Classification",

abstract = "In recent decades, the availability of a large amount of data has propelled the field of machine learning enormously. Machine learning, however, relies heavily on the availability of annotated data, typically labels indicating to which class a data instance belongs. With the huge amounts of data, this raises the question of how to efficiently annotate data, certainly when having limited resources. This thesis addresses the particular challenge of using as few annotations as possible, while at the same time, maintaining a good learning performance. For that we utilize active learning, which iteratively chooses the most valuable instances as to obtain the labels froman oracle (e.g. a human expert). Though many studies have demonstrated that active learning can reduce the annotation cost, there are still several issues that limit its practical use. This thesis makes a further step forwards making active learning more practical for real-world applications.We first provide a benchmark and comparison of six different categories of active learning algorithms built on logistic regression. This work provides a better understanding of the underlying characteristics of various active learners and illustrates the potential benefits of using such techniques, but it also provides many cases for which active learning fails to outperform passive learning (i.e. randomly selecting instances for labeling). Those failed cases motivate us to propose two novel active learning methods that show a clear advantage over passive learning. The first one proposes to weight the so-called retraining-based criteria with an uncertainty score that is measured by the estimated posterior probability. The second one measures the usefulness of unlabeled instances according to the variance of the predictive probability. This method takes an additional step towards practical active learning, clearly outperforming current state of the art on binary andmulti-class classification tasks.We further consider two realistic issues when applying active learning to real-world problems. One is how to find an initial set that contains at least one instance per class to start the active labeling cycle. The other one is dealing with the absence of human annotators in the interactive labeling loop. We propose new approaches to tackle the above problems and observe good performance compared to existing methods. This thesis concludes with an analysis of the contributions and limitations of our work, as well as research directions that deserve further studies.We hope that this thesis also inspires others to make active learning more suitable for real-world applications.",

author = "Yazhou Yang",

year = "2018",

doi = "10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0",

language = "English",

isbn = "978-94-6380-102-7",

type = "Dissertation (TU Delft)",

school = "Delft University of Technology",

}

TY - THES

T1 - Towards Practical Active Learning for Classification

AU - Yang, Yazhou

PY - 2018

Y1 - 2018

N2 - In recent decades, the availability of a large amount of data has propelled the field of machine learning enormously. Machine learning, however, relies heavily on the availability of annotated data, typically labels indicating to which class a data instance belongs. With the huge amounts of data, this raises the question of how to efficiently annotate data, certainly when having limited resources. This thesis addresses the particular challenge of using as few annotations as possible, while at the same time, maintaining a good learning performance. For that we utilize active learning, which iteratively chooses the most valuable instances as to obtain the labels froman oracle (e.g. a human expert). Though many studies have demonstrated that active learning can reduce the annotation cost, there are still several issues that limit its practical use. This thesis makes a further step forwards making active learning more practical for real-world applications.We first provide a benchmark and comparison of six different categories of active learning algorithms built on logistic regression. This work provides a better understanding of the underlying characteristics of various active learners and illustrates the potential benefits of using such techniques, but it also provides many cases for which active learning fails to outperform passive learning (i.e. randomly selecting instances for labeling). Those failed cases motivate us to propose two novel active learning methods that show a clear advantage over passive learning. The first one proposes to weight the so-called retraining-based criteria with an uncertainty score that is measured by the estimated posterior probability. The second one measures the usefulness of unlabeled instances according to the variance of the predictive probability. This method takes an additional step towards practical active learning, clearly outperforming current state of the art on binary andmulti-class classification tasks.We further consider two realistic issues when applying active learning to real-world problems. One is how to find an initial set that contains at least one instance per class to start the active labeling cycle. The other one is dealing with the absence of human annotators in the interactive labeling loop. We propose new approaches to tackle the above problems and observe good performance compared to existing methods. This thesis concludes with an analysis of the contributions and limitations of our work, as well as research directions that deserve further studies.We hope that this thesis also inspires others to make active learning more suitable for real-world applications.

AB - In recent decades, the availability of a large amount of data has propelled the field of machine learning enormously. Machine learning, however, relies heavily on the availability of annotated data, typically labels indicating to which class a data instance belongs. With the huge amounts of data, this raises the question of how to efficiently annotate data, certainly when having limited resources. This thesis addresses the particular challenge of using as few annotations as possible, while at the same time, maintaining a good learning performance. For that we utilize active learning, which iteratively chooses the most valuable instances as to obtain the labels froman oracle (e.g. a human expert). Though many studies have demonstrated that active learning can reduce the annotation cost, there are still several issues that limit its practical use. This thesis makes a further step forwards making active learning more practical for real-world applications.We first provide a benchmark and comparison of six different categories of active learning algorithms built on logistic regression. This work provides a better understanding of the underlying characteristics of various active learners and illustrates the potential benefits of using such techniques, but it also provides many cases for which active learning fails to outperform passive learning (i.e. randomly selecting instances for labeling). Those failed cases motivate us to propose two novel active learning methods that show a clear advantage over passive learning. The first one proposes to weight the so-called retraining-based criteria with an uncertainty score that is measured by the estimated posterior probability. The second one measures the usefulness of unlabeled instances according to the variance of the predictive probability. This method takes an additional step towards practical active learning, clearly outperforming current state of the art on binary andmulti-class classification tasks.We further consider two realistic issues when applying active learning to real-world problems. One is how to find an initial set that contains at least one instance per class to start the active labeling cycle. The other one is dealing with the absence of human annotators in the interactive labeling loop. We propose new approaches to tackle the above problems and observe good performance compared to existing methods. This thesis concludes with an analysis of the contributions and limitations of our work, as well as research directions that deserve further studies.We hope that this thesis also inspires others to make active learning more suitable for real-world applications.

U2 - 10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0

DO - 10.4233/uuid:12720d56-8a35-4b36-b287-2e301ae69bd0

M3 - Dissertation (TU Delft)

SN - 978-94-6380-102-7

ER -

Towards Practical Active Learning for Classification

Abstract

Access to Document

Fingerprint

Cite this