Peer Session Badge

Self-supervised Pre-training Model

  • GPT-1

    • special tokens
      • start toeken
      • extract token
      • delim token
    • Training
      • trained on BookCorpus(800M words), batch size 32000 words, same learning rate (5e-5)
  • BERT

    • special tokens
      • CLS token
      • SEP token
    • Training
      • trained on BookCorpus and Wikipedia(2500M words), batch size 128000 words, chooses a task-specific fine-tuning learning rate
    • masked language model
    • bi-directional language model의 경우 생길 수 있는 cheating 문제 해결
    • 몇 퍼센트의 단어를 mask하여 (단어를 예측하며) 학습할지가 중요한 hyperparameter (ex. $k$ = 15%)
    • Architecture
      • BERT BASE: L=12(쌓은 self-attention block의 수), H=768, A=12(각 layer별로 정의되는 attention head의 수)
      • BERT LARGE: L=24, H=1024, A=16
    • Input representation
      • WordPiece embeddings (단위: subword)
      • learned positional embedding
      • packed sentence embedding (SEP)
      • segment embedding
    • Transfer learning examples
      • sentence pair classification task
      • single sentence classification task
      • question answering task
      • single sentence tagging task
  • GPT-2

    • trained on 40GB of good quality text (including Reddit)
    • perform down-stream tasks in a zero-shot setting (without any modification)
    • NLP Decathlon (multitask learning as QnA)
      • 다양한 task를 모두 QnA task로 바꿔 통합해서 학습
    • byte pair encoding
    • How to Build OpenAI’s GPT-2
  • GPT-3

    • scaling up language models greatly imporves task-agnostic, few-shot performance
    • autoregressive model with 175 billion params in few-shot setting
    • 96 attention layers, batch size 3.2M
  • ALBERT

    • a lite BERT
    • Problem
      • memory limitation
      • training speed
    • Solution
      • factorized embedding parameterization
        • $VH \rightarrow (VE) * (E*H)$
        • param 수 감소: $500 \cdot 100 \rightarrow 500 \cdot 15 + 15 \cdot 100$
      • cross-layer parameter sharing
        • BERT 대비 param 수 1/3 but 성능 감소 미미
      • sentence order prediction (for performance)
    • 논문
  • ELECTRA

    • Efficiently learning an Encoder that Classifies Token Replacements Accurately
    • GAN idea
      • BERT(generator) + ELECTRA(discriminator)
    • 논문
  • Light-weight models

    • 성능을 유지하면서도 경량화
    • DistillBERT
      • distillation loss
      • teacher model and student model
    • TinyBERT
  • Fusing knowledge graph into language model

    • ERNIE
      • Enhanced language Representation with Informative Entities (2019)
    • KagNET
      • Knowledge-Aware Graph Networks for commonsense reasoning (2019)