Boostcamp AI Tech (Day 016)

My assignment: NLP 전처리

NLP (Natural Language Processing)

Step 1. Constructing the vocabulary containing unique words
- “John really really loves this movie”, “Jane really loves this song”
- Voca: {“John”, “really”, “loves”, “this”, “moive”, “Jane”, “song”}
Step 2. Encoding unique words to one-hot vectors
- John: $[1,0,0,0,0,0,0]$, really: $[0,1,0,0,0,0,0]$, …
- 모든 단어 사이 유클리드 거리: $\sqrt{2}$
- 모든 단어 사이 코사인 유사도(cosine similarity, 내적값): 0 (모두가 동일한 관계)
- Sentence/document can be represented as the sum
  - John + really + really + loves: $[1,2,1,0,0,0,0]$
NaiveBayes classifier

$C_{MAP} = \underset {c \in C} {\arg max} P(c \vert d) = \underset {c \in C} {\arg max} \frac{P(d \vert c)P(c)}{P(d)}$

$= \underset {c \in C} {\arg max} P(d \vert c)P(c)$ (drop the denominator)
- $d$: document, $c$: class
- MAP: maximum a posteriori (most likely class)
- $P(d \vert c)$: 특정 카테고리 c가 고정되었을 때, 문서 d가 나타날 확률
  
  $= P(w_1, w_2, \dots, w_n \vert c)$ : 단어1 ~ 단어n의 동시 사건 확률
- $P(w_1, w_2, \dots, w_n \vert c)P(c) = P(c) \prod_{w_i \in W} P(w_i \vert c)$
  
  (c가 고정일 때 단어들의 확률이 서로 독립이라 가정)

Definition
- Express a word as a vector
- Similar words have similar vertor representations $\rightarrow$ short distance (ex. ‘cat’ and ‘kitty’)
Word2Vec
- Word2Vec visualization
- 가정: words in similar context (adjacent) will have similar meanings
- 학습 방법: “cat”이란 단어가 주어진다면, 그 주변 단어들을 숨긴 채 예측함으로써, 주변에 나타나는 단어들의 확률 분포를 예측(skip-gram)
  - CBOW (Continuous Bag Of Words): 주변단어들이 주어지면 중심 단어 예측
  - Skip-gram: 중심 단어가 주어지면 주변 단어들 예측
- 학습 결과
  - vec(queen) - vec(king) = vec(woman) - vec(man)
  - 한국-서울+도쿄 = 일본
  - 아이폰-스마트폰+노트북 = 아이패드
- Intrusion task
  - 주어진 단어들 중 가장 (평균에서) 거리가 먼 것 detect
- Word similarity, machine translation, PoS tagging, …
- 논문 (NIPS’13)
GloVe (Global Vectors for word representation)
- 각 단어쌍에 대해서 한 윈도우 내에서 총 몇 번 동시에 등장했는지 사전에 미리 계산
- 가충치 행렬의 내적값이 위의 (로그)값(ground_truth로 사용)과 가까워지도록 학습
  
  $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^T v_j - \log P_{ij})^2$
- 논문 (EMNLP’14)