End-to-End Neural Pipeline for Goal-Oriented Dialogue Systems using GPT-2

Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, Kee-Eung Kim. KAIST. ACL2020

HighLight

It is trained to follow the traditional dialogue management pipeline, making the monolithic neural model more interpretable and easily integratable with external systems
It is trained in an end-to-end fashion with simple gradient descent
Leverages GPT-2, a powerful pre-trained language model

Introduction

Traditional goal-oriented dialogue system mostly adopts a pipelined modular architecture:

Natural Language Understanding (NLU) module that first recognizes and comprehends user’s intent and extracts values for slots
- Input: user's utterance $X_n$
- Output: $U_n=(I_n, Z_n)$ , where $I_n$ refers to Intention, $Z_n$ refers to slot-value pairs
Dialogue State Tracking (DST) module that tracks the values of slots
- Input: $U_n$ , $A_{n-1}$ , $S_{n-1}$ (N-best list)
- Output: $S_{n}$
Dialogue policy (POL) module that decides the system action
- Input: $S_{n}$
- Output: $A_{n}$
Natural language generation (NLG) module that generates the utterance that corresponds to the system action
- Input: $A_{n}$
- Output: $Y_{n}$

End-to-end methods build a dialog system using a single model, where natural language context is taken as input and natural language response is generated as an output

Dataset

MultiWOZ dataset
Evaluated by ConvLab

An example of a single-domain dialogue in the MultiWOZ dataset

Each dialogue consists of ‘Goal’, ‘Database’ and ‘Dialogue turns’.

Goal is defined by the domain and the slots. The slots are divided into informable, requestable and book slots.
- Informable slots represent user constraints
- Requestable slots hold additional information that the user wants to obtain
- Book slots are used to reserve a place recommended by the system

End-to-end neural dialogue model

An overall architecture with a concrete example

Predict the recent domain and the corresponding dialogue state conditioned on the dialogue history
Predict the system action with delexicalized tokens conditioned on the dialogue history and dialogue state
If the system action (e.g. ‘inform’, ‘book’) needs external information from the database, the query module2 retrieves the candidates and returns one of them
Update the current system action when detecting Empty Query Results
Generate the system response with delexicalized tokens conditioned on dialogue history, dialogue state, and system action
Update the delexicalized tokens in the system response with the query result

Input

In the MultiWOZ dataset, the ‘metadata’ is treated as the dialogue state and the ‘dialogue act’ is treated as the system action

Delimiter tokens :

<usr>
<sys>
<ds>
<sa>

Special tokens :

domain and slot names
<nm> and <dc>

Input embedding = Token embedding + Speaker embedding + Positional embedding

Training Objective

The objective function is the weighted sum of the objectives of language modeling (LM) and next-utterance classification (NC) :

$L_{\text {total }}(W)=\alpha_{L M} L_{L M}(W)+\alpha_{N C} L_{N C}(W)$

For LM, $L_{L M}\left(w_{1}, \ldots, w_{n}\right)=\sum_{i} \log P\left(w_{i} \mid w_{1}, \ldots, w_{i-1}\right)$
For NC, the model needs to distinguish the gold response (gold dialogue state+gold system action+gold system response) from a distractor (gold dialogue state+gold system action+fake system response), given the dialogue history