Boostcamp AI Tech (Day 031)
My assignment: VGGNet
Computer vision
“Inverse graphics”
- Graphics
- Rendering
- 3D 모델 $\rightarrow$ 영상
- Computer vision
- Inverse rendering
- 영상 정보 $\rightarrow$ 본질 (ex. 3D 모델)
- Graphics
Visual perception
- Perception
- (input, output) data
- visual world - sensing device - interpreting device - interpretation
- Class of visual perception
- Color
- Motion
- 3D
- Semantic-level
- Social (emotion)
- Visuomotor
- Understanding human visual perception, etc
- Perception
Computer vision
- Machine learning
- Input - feature extraction - classification - output
- Deep learning
- End-to-End
- Input - feature extraction & classification - output
- Input - feature extraction & classification - output
- End-to-End
- Machine learning
Image classification
- Input x $-$classifier$\rightarrow$ output f(x)
- x: image
- f(x): category level
- k-NN (Nearest Neighbors)
- Search all data
- Time complexity: O($\infty$)
- Memory complexity: O($\infty$)
- Locally connected NN
- (cf. Fully connected NN)
- Local feature learning
- Parameter sharing
- Backbone (base) of many CV tasks
- Image-level classification
- Classification + regression
- Pixel-level classification
- Locally connected NN
- Input x $-$classifier$\rightarrow$ output f(x)
CNN architectures for Image classification
- 참고 데일리노트
- AlexNet
- Conv - Pool - Conv - Pool - FC - FC
- LRN (Local Response Normalization)
- 더 이상 사용되지 않음 (Batch Norm에게 대체됨)
- activation map에서 명암을 normalization
- Receptive field
- The region in the INPUT SPACE that a particular CNN feature is looking at
- AlexNet uses 11 $\times$ 11 conv to get bigger receptive field (더 이상 사용되지 않음)
- VGGNet (ICLR 2015)
- 16, 19 layers
- Simpler architecture
- No LRM
- 3 $\times$ 3 conv filters blocks
- many small conv layers instead of a small number of large conv filters
- keeping receptive field sizes large enough
- deeper with more non-linearities
- fewer parameters
- 2 $\times$ 2 max pooling
- Better performance & generalization
Annotation data efficient learning
Data augmenttation
- Learning representation of data
- Dataset is always biased
- Training data is sparse samples of real data
- Augmentation
- Fill more space and close the gap (make a dataset denser)
- Methods
- crop
- affine transformation (shear)
- brightness
- perspective
- rotate
- flip
- CutMix
- RandAugment
- Learning representation of data
Leveraging pre-trained information
- Needs
- supervised learning requires a very large-scale dataset
- annotating data is very expensive + quality not ensured
- Motivation
- similar datasets share common information
- knowledge learned from one dataset can be applied to other datasets
- Transfer learning
- Approch 1
- given pre-trained model
- chop off FC layer and re-train new FC layer
- Approch 2
- given pre-trained model
- chop off FC layer and re-train the whole model
- low learning rate in Conv layers
- high learning rate in FC layers
- Approch 1
- Knowledge distillation
- Distillate knowledge of a trained model $\rightarrow$ another smaller model
- Used for Model compression
- Used for pseudo-labeling
- Teacher-Studnent network
- 참고 자료
- Needs
- containing methods
- Augmentation
- Knowledge distillation (teacher-student)
- Semi-supervised learning (peudo-label)
- 순서
T: Teacher model
S: Student model- 초기 T를 labeled data로 학습
- T를 이용해 unlabeled data를 pseudo-label
- S를 labeled + unlabeled data (+augmentation)로 학습
- S를 새로운 T로 설정, 더 큰 새 모델을 S로 설정
- 위 과정을 반복
- containing methods