Boostcamp AI Tech (Day 018)

My assignment: Training NMT model

Encoder-Decoder Architecture

Seq2Seq model
- many to many (sequences of words as input and output)
- composed of encoder and decoder
- 참고 자료
- 논문
Attention module
- encoder의 hidden state가 뒤로 갈수록 초반부의 정보를 유실하는 문제 해결
- encoder는 $h_1^e, h_2^e, \dots, h_n^e$를 decoder에 한번에 전달
- decoder는 각 time step에서 단어 생성 시 선별적으로 활용
- attention scores (encoder의 각 단어와 hidden vector의 내적, 유사도)를 바탕으로 가중 평균 후 attention output(context vector) 계산
- 장점
  - decoder가 소스의 특정 parts에 주목하게 함
  - bottleneck problem 해결
  - vanishing gradient problem 해결
  - interpretability 제공 (언제 어떤 단어에 집중하는지)
- 논문
Seq2Seq with Attention
- decoder에서 다음 단어 예측할 때 이전 단어에서 넘어온 hidden state와, encoder의 전체 단어와의 attention score 두 가지를 추가적인 입력으로 사용
- 유사도 $score(h_t, \bar h_s)$ 계산 방법
  
  $1. h_t^T \bar h_s \text{ (dot)}$
  
  $2. h_t^T W_a \bar h_s \text{ (general)}$
  
  $3. v_a^T \text{tanh} (W_a[h_t; \bar h_s]) \text{ (concat)}$

Beam search

Greedy decoding
- 현재 time step에서 가장 좋아보이는 단어 1개를 선택
- 문제점: undo할 방법이 없음 (번역 진행 순서 역행 불가)
Exhaustive search
- time step 마다 모든 조합을 고려
- find a length T translation y that maximizes $P(y \vert x) = P(y_1 \vert x)P(y_2 \vert y_1 ,x) … P(y_T \vert y_1, …, y_{t-1},x)$
  
  $= \prod_1^T P(y_t \vert y_1, …, y_{t-1},x)$
- 문제점: 연산량이 너무 큼
  
  $O(V^t)$
  
  $V$: vocabulary size
Beam search
- time step 마다 1개(greedy)도 $V^t$개(exhaustive)도 아닌, k개의 가능한 경우의 수 고려
- $k$: beam size (일반적으로 5~10) $score(y_1, \dots, y_t) = \log P_{LM}(y_1, \dots, y_t \vert x)$
  
  $= \sum_{i=1}^t \log P_{LM}(y_i \vert y_1, \dots, y_{i-1}, x)$
- exhaustive에 비해 훨씬 효율적임
- 참고 자료
BLEU score
- precision and recall
  - Reference: Half of my heart is in Havana ooh na na
  - Predicted: Half as my heart is in Obama ooh na
    
    $precision = \frac{\text{# correct words}}{\text{length of prediction}} = \frac{7}{9}$
    
    $recall = \frac{\text{# correct words}}{\text{length of reference}} = \frac{7}{10}$
  - 문제점: 단어 순서가 달라도 정확한 예측으로 계산
- BiLingual Evaluation Understudy (BLEU)
  - N-gram overlap: 연속된 n개의 단어를 기준으로 precision 계산
    
    $BLEU = \min(1, \frac{\text{length of prediction}}{\text{length of reference}}) (\prod_{i=1}^4 precision_i)^{\frac{1}{4}}$
    
    = (brevity penalty)(precision 기하평균)
  - brevity penalty: 너무 짧은 번역에 대한 패널티

내 질문 (Masterclass Q&A)
- Q. 반어법이나 돌려말하기 등 (같은 문장이라도) 실제로 어떤 문맥과 의도를 가지고 있는지를 분석하는 화용(pragmatic)분석에까지 높은 성능을 보이는 모델은 아직 못 본 것 같은데, 이런 연구가 어떻게 진행되고 있고 앞으로 어떻게 될지, 교수님은 어떤 생각을 가지고 계신지 궁금합니다.
- A. 교수님: 고전적 task인 sentiment analysis에서도 벌써 등장한 long standing problem. 여전히 challenging. GPT-3, BERT처럼 데이터 양과 모델 사이즈만으로 범용 문제를 해결하려는 접근도 아직 이런 부분은 돌파구를 찾지 못한 것 같음.

Encoder-Decoder Architecture

Seq2Seq model

Attention module

Seq2Seq with Attention

Beam search

Greedy decoding

Exhaustive search

Beam search

BLEU score

내 질문 (Masterclass Q&A)