Effective crowdsourced generation of training data for chatbots natural language understanding

Rucha Bapat; Pavel Kucherbaev; Alessandro Bozzon

doi:10.1007/978-3-319-91662-0_8

Effective crowdsourced generation of training data for chatbots natural language understanding

Rucha Bapat, Pavel Kucherbaev^*, Alessandro Bozzon

^*Corresponding author for this work

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

8 Citations (Scopus)

248 Downloads (Pure)

Abstract

Chatbots are text-based conversational agents. Natural Language Understanding (NLU) models are used to extract meaning and intention from user messages sent to chatbots. The user experience of chatbots largely depends on the performance of the NLU model, which itself largely depends on the initial dataset the model is trained with. The training data should cover the diversity of real user requests the chatbot will receive. Obtaining such data is a challenging task even for big corporations. We introduce a generic approach to generate training data with the help of crowd workers, we discuss the approach workflow and the design of crowdsourcing tasks assuring high quality. We evaluate the approach by running an experiment collecting data for 9 different intents. We use the collected training data to train a natural language understanding model. We analyse the performance of the model under different training set sizes for each intent. We provide recommendations on selecting an optimal confidence threshold for predicting intents, based on the cost model of incorrect and unknown predictions.

Original language	English
Title of host publication	Web Engineering - 18th International Conference, ICWE 2018, Proceedings
Publisher	Springer
Pages	114-128
Number of pages	15
ISBN (Electronic)	978-3-319-91662-0
ISBN (Print)	978-3-319-91661-3
DOIs	https://doi.org/10.1007/978-3-319-91662-0_8
Publication status	Published - 2018
Event	18th International Conference on Web Engineering, ICWE 2018 - Caceres, Spain Duration: 5 Jun 2018 → 8 Jun 2018

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10845 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	18th International Conference on Web Engineering, ICWE 2018
Country/Territory	Spain
City	Caceres
Period	5/06/18 → 8/06/18

Bibliographical note

Accepted Author Manuscript

Keywords

Conversational agents
Crowdsourcing
Natural language understanding

Access to Document

10.1007/978-3-319-91662-0_8

Effective crowdsourced generation of training data for chatbots natural language understandingAccepted author manuscript, 337 KB

Cite this

Bapat, R., Kucherbaev, P., & Bozzon, A. (2018). Effective crowdsourced generation of training data for chatbots natural language understanding. In Web Engineering - 18th International Conference, ICWE 2018, Proceedings (pp. 114-128). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10845 LNCS). Springer. https://doi.org/10.1007/978-3-319-91662-0_8

Bapat, Rucha ; Kucherbaev, Pavel ; Bozzon, Alessandro. / Effective crowdsourced generation of training data for chatbots natural language understanding. Web Engineering - 18th International Conference, ICWE 2018, Proceedings. Springer, 2018. pp. 114-128 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{93fdad6e7d974441b3caa1e9dde637d7,

title = "Effective crowdsourced generation of training data for chatbots natural language understanding",

abstract = "Chatbots are text-based conversational agents. Natural Language Understanding (NLU) models are used to extract meaning and intention from user messages sent to chatbots. The user experience of chatbots largely depends on the performance of the NLU model, which itself largely depends on the initial dataset the model is trained with. The training data should cover the diversity of real user requests the chatbot will receive. Obtaining such data is a challenging task even for big corporations. We introduce a generic approach to generate training data with the help of crowd workers, we discuss the approach workflow and the design of crowdsourcing tasks assuring high quality. We evaluate the approach by running an experiment collecting data for 9 different intents. We use the collected training data to train a natural language understanding model. We analyse the performance of the model under different training set sizes for each intent. We provide recommendations on selecting an optimal confidence threshold for predicting intents, based on the cost model of incorrect and unknown predictions.",

keywords = "Conversational agents, Crowdsourcing, Natural language understanding",

author = "Rucha Bapat and Pavel Kucherbaev and Alessandro Bozzon",

note = "Accepted Author Manuscript; 18th International Conference on Web Engineering, ICWE 2018 ; Conference date: 05-06-2018 Through 08-06-2018",

year = "2018",

doi = "10.1007/978-3-319-91662-0_8",

language = "English",

isbn = "978-3-319-91661-3",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "114--128",

booktitle = "Web Engineering - 18th International Conference, ICWE 2018, Proceedings",

}

Bapat, R, Kucherbaev, P & Bozzon, A 2018, Effective crowdsourced generation of training data for chatbots natural language understanding. in Web Engineering - 18th International Conference, ICWE 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10845 LNCS, Springer, pp. 114-128, 18th International Conference on Web Engineering, ICWE 2018, Caceres, Spain, 5/06/18. https://doi.org/10.1007/978-3-319-91662-0_8

Effective crowdsourced generation of training data for chatbots natural language understanding. / Bapat, Rucha; Kucherbaev, Pavel; Bozzon, Alessandro.
Web Engineering - 18th International Conference, ICWE 2018, Proceedings. Springer, 2018. p. 114-128 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10845 LNCS).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Effective crowdsourced generation of training data for chatbots natural language understanding

AU - Bapat, Rucha

AU - Kucherbaev, Pavel

AU - Bozzon, Alessandro

N1 - Accepted Author Manuscript

PY - 2018

Y1 - 2018

N2 - Chatbots are text-based conversational agents. Natural Language Understanding (NLU) models are used to extract meaning and intention from user messages sent to chatbots. The user experience of chatbots largely depends on the performance of the NLU model, which itself largely depends on the initial dataset the model is trained with. The training data should cover the diversity of real user requests the chatbot will receive. Obtaining such data is a challenging task even for big corporations. We introduce a generic approach to generate training data with the help of crowd workers, we discuss the approach workflow and the design of crowdsourcing tasks assuring high quality. We evaluate the approach by running an experiment collecting data for 9 different intents. We use the collected training data to train a natural language understanding model. We analyse the performance of the model under different training set sizes for each intent. We provide recommendations on selecting an optimal confidence threshold for predicting intents, based on the cost model of incorrect and unknown predictions.

AB - Chatbots are text-based conversational agents. Natural Language Understanding (NLU) models are used to extract meaning and intention from user messages sent to chatbots. The user experience of chatbots largely depends on the performance of the NLU model, which itself largely depends on the initial dataset the model is trained with. The training data should cover the diversity of real user requests the chatbot will receive. Obtaining such data is a challenging task even for big corporations. We introduce a generic approach to generate training data with the help of crowd workers, we discuss the approach workflow and the design of crowdsourcing tasks assuring high quality. We evaluate the approach by running an experiment collecting data for 9 different intents. We use the collected training data to train a natural language understanding model. We analyse the performance of the model under different training set sizes for each intent. We provide recommendations on selecting an optimal confidence threshold for predicting intents, based on the cost model of incorrect and unknown predictions.

KW - Conversational agents

KW - Crowdsourcing

KW - Natural language understanding

UR - http://www.scopus.com/inward/record.url?scp=85047996275&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-91662-0_8

DO - 10.1007/978-3-319-91662-0_8

M3 - Conference contribution

SN - 978-3-319-91661-3

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 114

EP - 128

BT - Web Engineering - 18th International Conference, ICWE 2018, Proceedings

PB - Springer

T2 - 18th International Conference on Web Engineering, ICWE 2018

Y2 - 5 June 2018 through 8 June 2018

ER -

Bapat R, Kucherbaev P, Bozzon A. Effective crowdsourced generation of training data for chatbots natural language understanding. In Web Engineering - 18th International Conference, ICWE 2018, Proceedings. Springer. 2018. p. 114-128. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-91662-0_8

Effective crowdsourced generation of training data for chatbots natural language understanding

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this