Named Entity Recognition and Typing (NER/NET) is a
challenging task, especially with long-tail entities such as the ones found
in scientific publications. These entities (e.g. “WebKB”, “StatSnowball”)
are rare, often relevant only in specific knowledge domains, yet important
for retrieval and exploration purposes. State-of-the-art NER approaches
employ supervised machine learning models, trained on expensive typelabeled
data laboriously produced by human annotators. A common
workaround is the generation of labeled training data from knowledge
bases; this approach is not suitable for long-tail entity types that are, by
definition, scarcely represented in KBs.
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion,
and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications.
Original languageEnglish
Title of host publicationInt. Semantic Web Conference (ISWC)
Place of PublicationMonterey, USA
Publication statusPublished - Oct 2018

ID: 45302869