Peer Session Badge

My assignment: CNN Visualization

Object detection

  • Fundatmental image recognition tasks

    • Semantic segmentation
    • Instance segmentation
      • semantic + 같은 클래스 내의 개체들도 구분
    • Panoptic segmentation
      • instance + more
  • Object detection

    • Classification + Box localization
      • $(P_{class}, x_{min}, y_{min}, x_{max}, y_{max})$
      • application
        • autonomous driving
        • OCR (Optical character recognition)
  • 딥러닝 이전의 시도

    • Gradient-based detector
      • HOG (Histogram of Oriented Gradients)
        • 사람이 직접 가정한 feature들에 따라, 가로 edge나 세로 edge 등을 이용해 검출
      • SVM
    • Selective search (box proposal)
      • segmentation $\rightarrow$ merge similar regions $\rightarrow$ extract candidate boxes
  • Two-stage detector

    • R-CNN
      • Regions with CNN features
      • Image $\rightarrow$ extract region proposals $\rightarrow$ compute CNN features $\rightarrow$ classify regions
    • Fast R-CNN
      • Image $\rightarrow$ Conv feature feature map $\rightarrow$ RoI pooling $\rightarrow$ extract RoI (Region of Interest) $\rightarrow$ class and box prediction for each RoI
      • 18배 빠름
      • 여전히 selective search와 같은 별도 알고리즘을 사용했기 때문에 학습 성능에 한계
    • Faster R-CNN
      • 최초의 End-toend object detection by NEURAL region proposal
      • IoU (Intersection over Union) = $\frac{교집합}{합집합}$
      • Anchor boxes
        • region proposal에 사용될 다른 size의 box 후보군들
      • Region Proposal Network (RPN)
        • sliding window가 Conv feature map을 돌면서,
          1. object인지 아닌지 classification
          2. k개의 anchor box들을 regression해 더 정교한 위치(box size) 찾음
      • Non-Maximum Suppression (NMS)
        • 너무 많은 bounding box들이 제안되기 때문에 최적의 proposal만 남겨놓기 위한 알고리즘
        • Step 1. Select box with highest objectiveness score
        • Step 2. Compare IoU of this box with other boxes
        • Step 3. Remove bounding boxes if IoU $\geq$ 50%
        • Step 4. Move to next highest objectiveness score (repeat 2 - 4)
  • Single-stage detector

    • 정확도 조금 포기하고 굉장히 빠른 속도 내는 모델
    • No explicit RoI pooling
    • 바로 Classification / Box regression
    • YOLO (You Only Look Once)
      1. S x S grid on input
      2. two track
        • $\rightarrow$ bounding boxes + confidence
        • $\rightarrow$ class probability map
      3. $\rightarrow$ final detections
    • SSD (Single Shot multibox Detector)
      • multi-scale output들을 내는 여러 개의 feature map 사용
      • 스케일이 다른 object들도 잘 detect
  • Single-stage의 단점과 극복

    • Class imbalance problem
      • RoI pooling이 없다보니 모든 영역에 대한 loss 계산
      • 쓸모 없는 영역에 대한 계산이 너무 많음
      • $\rightarrow$ Focal loss
        • improved cross entropy loss
        • $FL(p_t) = -(1 - p_t)^{\gamma} log(p_t)$
    • RetinaNet
      • FPN (Feature Pyramid Networks) + class/box prediction branches
      • multi-scale로 feature map 뽑는 피라미드 구조
    • Detector with transformer

      • ViT (Vision Transformer) by Google
      • DeiT (Data-efficient image Transformer) by Facebook
      • DETR (DEtection TRansformer) by Facebook
        • Backbone: CNN, positional encoding
        • Encoder
        • Decoder
          • input: encoder 출력 + object queries
        • Prediction heads

CNN Visualization

  • Visualizing CNN

    • As debugging tools
    • Types of NN visualization
      • param explanation
        • filter visualization
        • factorization lens
      • feature analysis
        • t-SNE
        • gradient ascent
      • sensitivity analysis
        • saliency map
        • gradCAM
      • decomposition
        • deepLIFT
        • LRP
  • Analysis of model behaviors

    • Embedding feature analysis
      • in a feature space
      • dimentionality reduction
    • Activation invetigation
      • layer activation
      • maximally activating patches
      • class visualization
      • gradient ascent
  • Model decision explanation

    • Saliency test
      • occlusion map
        • prediction score depends on the location of mask
      • via Backprop
        • derivatives of a class score wrt input domain
    • Class activation mapping (CAM)
      • GAP (Global Average Pooling) layer instead of FC layer
      • grad-CAM