[실험] SOM-DST 논문 정리 #62

Encoder에서 나온 Operation의 결과가 Update 인 경우 해당 slot의 value를 예측
SOM-DST의 generator는 value를 $J$ 가 아닌 $J^\prime_t$ 개의 slot에 대해서만 만들어준다.

대부분의 경우에서 $J^\prime_t \ll J$ 이기 때문에 더 효율적이라고 주장
Decoder 모델로 GRU 사용
- 입력으로 word embedding vector $e_t^{j,k}$ 를 받으면서 GRU의 hidden state vector $g_t^{j, k}$ 를 recurrent하게 업데이트
- $g_t^{j, 0} = h_t^{\rm x}$ , $e_t^{j,0} = h_t^{[slot]j}$ : GRU에 들어가는 초기값
- $g_t^{j, k} = GRU(g_t^{j, k-1}, e_t^{j,k})$
- $e_t^{j,k}$ 가 [EOS] 토큰이 나올때까지 진행
- hidden state $g_t^{j, k}$ 는 k-th decoding step을 거치면서 vocabulary 와 user utterance의 단어에 대한 확률 분포로 변함
  $P^{j, k}_{vcb, t} = softmax(Eg^{j, k}_t) \in \mathbb R^{d_{vcb}}$
  - $E \in \mathbb R^{d_{vcb}\times d}$ : Encoder와 Decoder가 서로 공유하는 word embedding matrix
    - $d_{vcb}$ : vocabulary size
  $P^{j, k}_{ctx, t} = softmax(H_t g_t^{j, k}) \in \mathbb R^{\left|X_t\right|}$
  - user utterance의 단어에 대한 확률 분포
  $P^{j, k}_{val, t} = \alpha P^{j, k}_{vcb, t} (1-\alpha) P^{j, k}_{ctx, t}$ : final output distribution
  - - $W_1 \in \mathbb R^{1\times (3d)}$ : learnable parameter
    - $c^{j, k}_t = P^{j, k}_{ctx, t} H_t \in \mathbb R^d$ : context vector

Objective Function

State operation predictor

Main Task

state operation classification

Auxiliary Task

domain classification

state operation classification 외에도 domain classification을 보조 task로 사용하여 모델이 dialogue turn 간의 slot operation과 domain transition의 상관 관계를 학습하도록 함

$P_{dom, t} = softmax(W_{dom} h_t^{\rm X})$

$W_{dom} \in \mathbb R^{d_{dom}\times d}$ : learnable parameter
$P_{dom, t} \in \mathbb R^{d_{dom}}$ : turn t에서 domain에 대한 확률 분포
- $d_{dom}$ : # of domains defined in the dataset

Average of the negative log-likelihood

$L_{opr, t} = -\frac{1}{J}\sum_{j=1}^{J}(Y_{opr, t}^j)^\top log(P^j_{opr, t})$

$L_{dom, t} = -(Y_{dom, t})^\top log(P_{dom, t})$

$Y_{dom, t} \in \mathbb R^{d_{dom}}$ : one-hot vector for the ground truth domain
$Y^j_{opr, t} \in \mathbb R^{\left| O\right|}$ : one-hot vector for the ground truth operation for the j-th slot

Slot value generator

Average of the negative log-likelihood

$L_{svg, t} = -\frac{1}{\left|\mathbb U_t\right|}\sum_{j\in\mathbb U_t}^{}\left[\frac{1}{K^j_t}\sum_{k=1}^{K^j_t}(Y_{val, t}^{j, k})^{\top}log(P^{j, k}_{val, t})\right]$

$K_t^j$ : # of tokens of the ground truth value that needs to be generated for the j-th slot
$Y_{val, t}^{j, k} \in \mathbb R^{d_{vcb}}$ : one-hot vector for the ground truth token that needs to be generated for the j-th slot at the k-th decoding step

Final Loss

to minimized $L_{joint, t} = L_{opr, t} L_{dom, t} L_{svg, t}$

Experimental Setup

Datasets

MultiWOZ 2.0 and MultiWOZ 2.1

Training

Encoder : Bert-base-uncased
Decoder : GRU
Hidden size : 768
Optimizer : BertAdam
Encoder LR and warmup : 4e-5, 0.1
Decoder LR and warmup : 1e-4, 0.1
Batch size : 32
Dropout : 0.1
Word Dropout 적용, 0.1확률로 word 를 [UNK] 로 바꿈
Input max length : 256
Training Epoch : 30

결과

Joint Goal Accuracy

Domain-specific Accuracy

Latency

평가

JGA, Domain-specific Accuracy 에서 SOTA 혹은 비슷한 수준의 성능을 보여줌
inferecnce 타임이 매우 짧음에도 불구하고 좋은 성능을 보여줌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[실험] SOM-DST 논문 정리 #62

{{title}}

Replies: 0 comments

Select a reply

[실험] SOM-DST 논문 정리 #62

changwoomon May 11, 2021 Maintainer

SOM-DST 논문 리뷰

문제 정의

참고 자료

기존 모델의 문제점

Ontology-based DST 문제점

TRADE 문제점

SOM-DST

Definition

State Operation Predictor (Encoder)

Encoder Input을 만들기 위한 준비물

Encoder Input

Encoder Output

State Operation Prediction

Slot Value Generator (Decoder)

Objective Function

State operation predictor

Slot value generator

Final Loss

Experimental Setup

Datasets

Training

결과

Joint Goal Accuracy

Domain-specific Accuracy

Latency

평가

Replies: 0 comments

changwoomon
May 11, 2021
Maintainer