Boostcamp AI Tech (Day 031)

My assignment: VGGNet

Computer vision

“Inverse graphics”
- Graphics
  - Rendering
  - 3D 모델 $\rightarrow$ 영상
- Computer vision
  - Inverse rendering
  - 영상 정보 $\rightarrow$ 본질 (ex. 3D 모델)
Visual perception
- Perception
  - (input, output) data
  - visual world - sensing device - interpreting device - interpretation
- Class of visual perception
  - Color
  - Motion
  - 3D
  - Semantic-level
  - Social (emotion)
  - Visuomotor
  - Understanding human visual perception, etc
Computer vision
- Machine learning
  - Input - feature extraction - classification - output
- Deep learning
  - End-to-End
    - Input - feature extraction & classification - output

CNN

Image classification
- Input x $-$classifier$\rightarrow$ output f(x)
  - x: image
  - f(x): category level
- k-NN (Nearest Neighbors)
  - Search all data
  - Time complexity: O($\infty$)
  - Memory complexity: O($\infty$)
- CNN
  - Locally connected NN
    - (cf. Fully connected NN)
    - Local feature learning
    - Parameter sharing
  - Backbone (base) of many CV tasks
    - Image-level classification
    - Classification + regression
    - Pixel-level classification
CNN architectures for Image classification
- 참고 데일리노트
- AlexNet
  - Conv - Pool - Conv - Pool - FC - FC
  - LRN (Local Response Normalization)
    - 더 이상 사용되지 않음 (Batch Norm에게 대체됨)
    - activation map에서 명암을 normalization
  - Receptive field
    - The region in the INPUT SPACE that a particular CNN feature is looking at
    - AlexNet uses 11 $\times$ 11 conv to get bigger receptive field (더 이상 사용되지 않음)
- VGGNet (ICLR 2015)
  - 16, 19 layers
  - Simpler architecture
    - No LRM
    - 3 $\times$ 3 conv filters blocks
      - many small conv layers instead of a small number of large conv filters
      - keeping receptive field sizes large enough
      - deeper with more non-linearities
      - fewer parameters
    - 2 $\times$ 2 max pooling
  - Better performance & generalization

Annotation data efficient learning

Data augmenttation
- Learning representation of data
  - Dataset is always biased
  - Training data is sparse samples of real data
- Augmentation
  - Fill more space and close the gap (make a dataset denser)
- Methods
  - crop
  - affine transformation (shear)
  - brightness
  - perspective
  - rotate
  - flip
  - CutMix
  - RandAugment
Leveraging pre-trained information
- Needs
  - supervised learning requires a very large-scale dataset
  - annotating data is very expensive + quality not ensured
- Motivation
  - similar datasets share common information
  - knowledge learned from one dataset can be applied to other datasets
- Transfer learning
  - Approch 1
    - given pre-trained model
    - chop off FC layer and re-train new FC layer
  - Approch 2
    - given pre-trained model
    - chop off FC layer and re-train the whole model
    - low learning rate in Conv layers
    - high learning rate in FC layers
- Knowledge distillation
  - Distillate knowledge of a trained model $\rightarrow$ another smaller model
  - Used for Model compression
  - Used for pseudo-labeling
  - Teacher-Studnent network
  - 참고 자료
Self-training
- containing methods
  - Augmentation
  - Knowledge distillation (teacher-student)
  - Semi-supervised learning (peudo-label)
- 순서
  
  T: Teacher model
  S: Student model
  1. 초기 T를 labeled data로 학습
  2. T를 이용해 unlabeled data를 pseudo-label
  3. S를 labeled + unlabeled data (+augmentation)로 학습
  4. S를 새로운 T로 설정, 더 큰 새 모델을 S로 설정
  5. 위 과정을 반복

Computer vision

“Inverse graphics”

Visual perception

Computer vision

CNN

Image classification

CNN architectures for Image classification

Annotation data efficient learning

Data augmenttation

Leveraging pre-trained information

Self-training