Boostcamp AI Tech (Day 019)
My assignment: Byte pair encoding
Transformer
-
Transformer (self-attention) 등장
- Bi-Directional RNN
- forward RNN, backward RNN 두 모듈 사용
- Self-attention
- seq2seq with attention model을 완전히 대체
- RNN 연산 없이 attention 연산만 이용해 입력 문장/단어의 representation을 학습
- 더 parrallel한 연산 가능
- 학습속도가 빠름
- long-term dependency 문제 해결한 sequence encoding 기법
- Illustrated Transformer [원문] [한글]
- 논문 (Attention is all you need)
- Bi-Directional RNN
-
Scaled dot-product attention
- Input
- a query $q$
- a set of key-value $(k,v)$ paris
- Output
- weighted sum of values
-
weight of each value: inner product of query and corresponding key
$A(q,K,V) = \sum_i \frac{exp(q \cdot k_i)}{\sum_j exp(q \cdot k_j)} v_i$
$d_k$: dimension of queries and keys
$d_v$: dimension of value
- Multiple queries
-
queries $\rightarrow$ query vector 후 한번에 연산
$A(Q,K,V) = \text{softmax}(QK^T)V$
-
-
Scaled by len of query/key vectors
-
softmax gets peaked $\rightarrow$ gradient gets smaller $\rightarrow$ scale
$A(Q,K,V) = \text{softmax} \Big(\frac{QK^T}{\sqrt{d_k}} \Big)V$
-
- Input
-
Multi-Head Attention
- Single attention
- only one way for words to interact with one another
-
Multi-head attention
$\text{MultiHead(Q,K,V)} = \text{concat}(head_1, \dots, head_h) W^O$
where $head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
- Single attention
-
Layer Normalization
- 0 mean & 1 std
- per layer and per training point
- Group Normailization
- 0 mean & 1 std
-
Positional Encoding
- Transformer 모델에 순서 정보를 포함시켜줌
-
unique한 상수 벡터를 더해줌 (sin, cos 등의 주기 함수를 사용)
$PE_{pos + k}$
-
Transformer: learning rate
- Warm-up learning rate scheduler
-
Masked self-attention
- decoder에서 사용
- (inference할 때 사용되어서는 안되는) ungenerated words를 제외하는 mask 씌움
-
Further topics