Peer Session Badge

My assignment: Byte pair encoding

Transformer

  • Transformer (self-attention) 등장

    • Bi-Directional RNN
      • forward RNN, backward RNN 두 모듈 사용
    • Self-attention
      • seq2seq with attention model을 완전히 대체
      • RNN 연산 없이 attention 연산만 이용해 입력 문장/단어의 representation을 학습
      • 더 parrallel한 연산 가능
      • 학습속도가 빠름
      • long-term dependency 문제 해결한 sequence encoding 기법
    • Illustrated Transformer [원문] [한글]
    • 논문 (Attention is all you need)
  • Scaled dot-product attention

    • Input
      • a query $q$
      • a set of key-value $(k,v)$ paris
    • Output
      • weighted sum of values
      • weight of each value: inner product of query and corresponding key

        $A(q,K,V) = \sum_i \frac{exp(q \cdot k_i)}{\sum_j exp(q \cdot k_j)} v_i$

        $d_k$: dimension of queries and keys

        $d_v$: dimension of value

    • Multiple queries
      • queries $\rightarrow$ query vector 후 한번에 연산

        $A(Q,K,V) = \text{softmax}(QK^T)V$

    • Scaled by len of query/key vectors

      • softmax gets peaked $\rightarrow$ gradient gets smaller $\rightarrow$ scale

        $A(Q,K,V) = \text{softmax} \Big(\frac{QK^T}{\sqrt{d_k}} \Big)V$

  • Multi-Head Attention

    • Single attention
      • only one way for words to interact with one another
    • Multi-head attention

      $\text{MultiHead(Q,K,V)} = \text{concat}(head_1, \dots, head_h) W^O$

      where $head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

  • Layer Normalization

  • Positional Encoding

    • Transformer 모델에 순서 정보를 포함시켜줌
    • unique한 상수 벡터를 더해줌 (sin, cos 등의 주기 함수를 사용)

      $PE_{pos + k}$

  • Transformer: learning rate

    • Warm-up learning rate scheduler
  • Masked self-attention

    • decoder에서 사용
    • (inference할 때 사용되어서는 안되는) ungenerated words를 제외하는 mask 씌움
  • Further topics