A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition

M. Marrero; J. Urbano

doi:10.1017/S135132491700016X

A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition

M. Marrero, J. Urbano

Multimedia Computing

Research output: Contribution to journal › Article › Scientific › peer-review

4 Citations (Scopus)

121 Downloads (Pure)

Abstract

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-Automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

Original language	English
Pages (from-to)	39-75
Number of pages	37
Journal	Natural Language Engineering
Volume	24
Issue number	1
DOIs	https://doi.org/10.1017/S135132491700016X
Publication status	Published - 2018

Bibliographical note

Accepted author manuscript

Access to Document

10.1017/S135132491700016X

36047307 - 011-semi-automatic-low-cost-method-learn-patterns-named-entity-recognitionAccepted author manuscript, 882 KB

Cite this

@article{deec8a9176914a24a4eaa3811dad487d,

title = "A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition",

abstract = "Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-Automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.",

author = "M. Marrero and J. Urbano",

note = "Accepted author manuscript",

year = "2018",

doi = "10.1017/S135132491700016X",

language = "English",

volume = "24",

pages = "39--75",

journal = "Natural Language Engineering",

issn = "1351-3249",

publisher = "Cambridge University Press",

number = "1",

}

TY - JOUR

T1 - A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition

AU - Marrero, M.

AU - Urbano, J.

N1 - Accepted author manuscript

PY - 2018

Y1 - 2018

N2 - Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-Automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

AB - Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-Automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

UR - http://www.scopus.com/inward/record.url?scp=85020493457&partnerID=8YFLogxK

U2 - 10.1017/S135132491700016X

DO - 10.1017/S135132491700016X

M3 - Article

AN - SCOPUS:85020493457

SN - 1351-3249

VL - 24

SP - 39

EP - 75

JO - Natural Language Engineering

JF - Natural Language Engineering

IS - 1

ER -

A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this