Self-supervised Pre-training Model
-
GPT-1
- special tokens
- start toeken
- extract token
- delim token
- Training
- trained on BookCorpus(800M words), batch size 32000 words, same learning rate (5e-5)
-
BERT
- special tokens
- Training
- trained on BookCorpus and Wikipedia(2500M words), batch size 128000 words, chooses a task-specific fine-tuning learning rate
- masked language model
- bi-directional language model의 경우 생길 수 있는 cheating 문제 해결
- 몇 퍼센트의 단어를 mask하여 (단어를 예측하며) 학습할지가 중요한 hyperparameter (ex. $k$ = 15%)
- Architecture
- BERT BASE: L=12(쌓은 self-attention block의 수), H=768, A=12(각 layer별로 정의되는 attention head의 수)
- BERT LARGE: L=24, H=1024, A=16
- Input representation
- WordPiece embeddings (단위: subword)
- learned positional embedding
- packed sentence embedding (SEP)
- segment embedding
- Transfer learning examples
- sentence pair classification task
- single sentence classification task
- question answering task
- single sentence tagging task
-
GPT-2
- trained on 40GB of good quality text (including Reddit)
- perform down-stream tasks in a zero-shot setting (without any modification)
- NLP Decathlon (multitask learning as QnA)
- 다양한 task를 모두 QnA task로 바꿔 통합해서 학습
- byte pair encoding
- How to Build OpenAI’s GPT-2
-
GPT-3
- scaling up language models greatly imporves task-agnostic, few-shot performance
- autoregressive model with 175 billion params in few-shot setting
- 96 attention layers, batch size 3.2M
-
ALBERT
- a lite BERT
- Problem
- memory limitation
- training speed
- Solution
- factorized embedding parameterization
- $VH \rightarrow (VE) * (E*H)$
- param 수 감소: $500 \cdot 100 \rightarrow 500 \cdot 15 + 15 \cdot 100$
- cross-layer parameter sharing
- BERT 대비 param 수 1/3 but 성능 감소 미미
- sentence order prediction (for performance)
- 논문
-
ELECTRA
- Efficiently learning an Encoder that Classifies Token Replacements Accurately
- GAN idea
- BERT(generator) + ELECTRA(discriminator)
- 논문
-
Light-weight models
- 성능을 유지하면서도 경량화
- DistillBERT
- distillation loss
- teacher model and student model
- TinyBERT
-
Fusing knowledge graph into language model
- ERNIE
- Enhanced language Representation with Informative Entities (2019)
- KagNET
- Knowledge-Aware Graph Networks for commonsense reasoning (2019)