Boostcamp AI Tech (Day 014)

My assignments: LSTM pytorch, SDPA & MHA pytorch

1. RNN (Recurrent Neural Networks)

Sequence data
- 소리, 문자열, 주가 등 이벤트의 발생 순서가 중요한 데이터
- 시계열(time-series) 데이터
- i.i.d.(독립동등분포) 가정을 자주 위배하기 때문에, 순서를 바꾸거나 과거 정보에 손실이 발생하면 데이터의 확률분포도 바뀜
Sequential model
- 이전 시퀀스의 정보를 가지고 앞으로 발생할 데이터의 확률분포를 다루기 때문에 조건부확률 이용
  
  $P(X_1,…,X_t) = P(X_t \vert X_1,…,X_{t-1})P(X_1,…,X_{t-1})$
  
  $= \prod_{s=1}^t P(X_s \vert X_1,…,X_{s-1})$
- $X_t \sim P(X_t \vert X_1,…,X_{t-1})$
- 과거의 모든 정보가 필요한 것은 아님. 맥락에 따라 잘 truncate하는게 중요
Sequential model 종류
1. 길이가 가변적인 데이터를 다룰 수 있는 모델
  - $X_t$는 t-1개, $X_{t+1}$은 t개의 정보 이용
2. AR(Autoregressive) Model (자기회귀 모델)
  - 고정된 길이 $\tau$ 만큼의 시퀀스만 이용하는 경우 (fix the past timespan) ex. Markov model
3. Latent AR Model (잠재자기회귀 모델)
  - 직전 정보와 잠재 변수(그 이전 데이터들) $H_t$ 고정된 두 개의 변수로 현재/미래 시점 예측. RNN이 여기에 속함
  - $H_t$: hidden state, summary of the past
Vanilla RNN
- MLP
  - $O = HW^{(2)} = b^{(2)}$
  - $H = \sigma(XW^{(1)} + b^{(1)})$
- 새로운 가중치 행렬 추가
  - $O_t = H_tW^{(2)} = b^{(2)}$
  - $H_t = \sigma(X_tW_X^{(1)} + H_{t-1}W_H^{(1)} + b^{(1)})$
  - (이전 시점의 잠재변수 $H_{t-1}$)
  - 역전파의 경우 BPTT (Backpropagation Through Time) 사용
- BPTT
  - $\partial w_h h_t = \partial w_h f(x_t, h_{t-1}, w_h) + \sum_{i=1}^{t-1} \Big( \prod_{j=i+1}^t \partial h_{j-1} f(x_j, h_{j-1}, w_h) \Big) \partial w_h f(x_i, h_{i-1}, w_h)$
  - 여기서 $\Big( \prod_{j=i+1}^t \partial h_{j-1} f(x_j, h_{j-1}, w_h) \Big)$ 이 항이, 시퀀스 길이가 길어질수록 불안정해짐 (vanishing/exploding gradient)
  - 이러한 기울기 소실을 해결하기위해 길이를 끊어주는 것이 필요 $\rightarrow$ truncated BPTT
- Short-term dependencies
  - Vanilla RNN의 경우, 먼 과거의 정보가 미래까지 살아남기 힘듦
  - 이를 해결하기 위해 Long Short-Term Memory(LSTM), GRU가 등장
LSTM

참고 자료
- Previous cell state $\rightarrow$ next cell state
  - 컨베이어벨트처럼, 유용한 정보를 계속 넘겨줌
- Previous hidden state (previous output), Input, Output (hidden state)
- Gates
  1. Forget gate
  - 어떤 정보를 버릴지 결정
    
    $f_t = \sigma (W_f \cdot \left[ h_{t-1}, x_t \right] + b_f)$
1. Input gate
  - 어떤 정보를 cell state에 저장할지 결정
    
    $i_t = \sigma (W_i \cdot \left[ h_{t-1},x_t \right] + b_i)$
    
    $\tilde C_t = tanh(W_C \cdot \left[ h_{t-1},x_t \right] + b_C)$
2. Update cell
  - Forget gate, Input gate의 값을 잘 취합해 cell state를 update
    
    $i_t = \sigma (W_i \cdot \left[ h_{t-1},x_t \right] + b_i)$
    
    $C_t = f_t * C_{t-1} + i_t * \tilde C_t$
3. Output gate
  - update 된 cell state로 output 출력
    
    $o_t = \sigma(W_o \cdot \left[ h_{t-1},x_t \right] + b_o)$
    
    $h_t = o_t * tanh(C_t)$
GRU (Gated Recurrent Unit)
- 2 gates
  - reset gate
  - update gate
- no cell state, just hidden state

2. Transformer

참고 자료

Problems of sequential models
- trim, omit, permute
Attention is all you need (NIPS, 2017)
- Transformer
  - attention 구조 사용
  - encoder가 n개의 단어를 한 번에 처리
  - encoders and decoders are stacked
  - n개의 단어 $\rightarrow$ enconder(self-attention $\rightarrow$ feed forward)
- Self-attention

1. RNN (Recurrent Neural Networks)

Sequence data

Sequential model

Sequential model 종류

Vanilla RNN

LSTM

Gates

GRU (Gated Recurrent Unit)

2. Transformer

Problems of sequential models

Attention is all you need (NIPS, 2017)