Fine-tuning for Image Captioning and Image QA with Segmented Images
Hi Team,
Thank you for developing this excellent model!
I am working on fine-tuning the model for image captioning and image-based question answering (QA), specifically for navigation-related tasks. My approach involves using both a normal image and a corresponding segmented image to help the model understand object positions based on the user’s perspective.
I have a few questions regarding the best way to implement this:
Handling Multiple Images:
Should I provide both the normal image and its segmented counterpart during training?
Does your processor handle positional embeddings for both images in a way that helps the model understand object locations correctly?
Alternative Approaches:
Would you recommend another strategy for incorporating segmented images to enhance spatial understanding?
Should I concatenate features from both images before passing them to the model, or is there a built-in mechanism for handling such multi-modal input effectively?
I would love to hear your thoughts on the best approach to train the model effectively for spatially-aware navigation tasks.
Looking forward to your insights!