Instruction-based Image Manipulation by Watching How Things Move

Instruction-based Image Manipulation by Watching How Things Move

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

Publication:

arXiv e-prints

Pub Date:

December 2024

arXiv:

arXiv:2412.12087

Bibcode:

2024arXiv241212087C

Keywords:

Computer Science - Computer Vision and Pattern Recognition

E-Print:

Project page: https://ljzycmd.github.io/projects/InstructMove/

ADS

Instruction-based Image Manipulation by Watching How Things Move

Abstract