CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data
Abstract
We present an approach named CurlingNet that can measure the semantic distance of composition of image-text embedding. In order to learn an effective image-text composition for the data in the fashion domain, our model proposes two key components as follows. First, the Delivery makes the transition of a source image in an embedding space. Second, the Sweeping emphasizes query-related components of fashion images in the embedding space. We utilize a channel-wise gating mechanism to make it possible. Our single model outperforms previous state-of-the-art image-text composition models including TIRG and FiLM. We participate in the first fashion-IQ challenge in ICCV 2019, for which ensemble of our model achieves one of the best performances.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2020
- DOI:
- 10.48550/arXiv.2003.12299
- arXiv:
- arXiv:2003.12299
- Bibcode:
- 2020arXiv200312299Y
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- 4 pages, 4 figures, ICCV 2019 Linguistics Meets image and video retrieval workshop, Fashion IQ challenge