ariG23498/gemma-3-4b-pt-object-detection

Hello,
I'm interested in finetuning gemma3 models on tasks that include bounding box localization (like grounding generation). For that i'm very interested in your work.
I'm wondering how much training time it took for the model to generate accurate bounding boxes ? I've been trying with other VLMs that do not hove that capability initially, and it seems hard to fine-tune on object detection.
Do you also have performance metrics to compare the 2 methods : with and without adding special tokens coordinates.
Thank you !