Multi-model: Captioning and speaking
-
Multi-modal learning
- Modality
- Vision, Audio, Taste, Texture, Odor, Social, Depth, Force, Text, …
- Challenge
- 다른 representations b\w modalities
- Feature space가 서로 unbalance
- Model이 specific modality에 biased 되기 쉽다 (예측하기 까다로운 데이터를 덜 신경쓰는 방향으로 학습)
-
Task (1) - Visual data & Text
- Text embedding
- word2vec (skip-gram model)
- Joint embedding
- text data $\rightarrow$ word counts $\rightarrow$ replicated softmax $\rightarrow$ joint embedding
- image data $\rightarrow$ real-valued feature $\rightarrow$ gaussian model $\rightarrow$ joint embedding
- applications
- imgae tagging
- image & food recipe retrieval
- Metric learning
- joint visual-semantic embedding space