fastvlm-0.5b-captions
Model Details
fastvlm-0.5b-captions
is a finetuned version of FastVLM-0.5B Stage 3 from the FastVLM official repository, built for efficient structured image captioning on mobile devices. This model incorporates LoRA fine-tuning, 4-bit quantization, and MobileCLIP-S0 as its vision tower, achieving substantial RAM reductions for embedded inference.
Model Description
- Developed by: Riddhiman Rana (fine-tuning and optimizations)
- Model type: VLM (Vision-Language Model)
- Original model authors: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
- Language(s) (NLP): English
- License (base model): apple-amlr
- Finetuned from model:
apple/ml-fastvlm
, specificallyFastVLM-0.5B Stage 3
Model Sources
- Base Model Repository: https://github.com/apple/ml-fastvlm
- Fine-tuning Training Dataset: https://huggingface.co/datasets/riddhimanrana/coco-fastvlm-2k-val2017
- FastVLM Paper (CVPR 2025): https://www.arxiv.org/abs/2412.13303
Uses
![]() |
Demo on iPhone 13 Pro Max
Direct Use
- Generating highly detailed, structured captions for images on mobile and embedded devices.
- Ideal for low-resource environments such as iPhones, MacBooks, and potentially other Apple Silicon devices via MLX and CoreML.
- Tested on iPhone 12/13 Pro Max/14 – reaching RAM usage below 1 GB and TTFT as low as 600ms on higher-end iPhones.
Out-of-Scope Use
- This is not designed for general-purpose multimodal reasoning beyond descriptive image captioning.
- Not suitable for text-only language tasks.
Bias, Risks, and Limitations
- Dataset was limited to 2,000 images from COCO 2017 Validation – captions may reflect biases in that dataset.
- The model’s structured captions might occasionally be verbose or repetitive depending on input complexity.
- Accuracy for extremely abstract or unfamiliar visual scenes may degrade.
Recommendations
How to Get Started with the Model
To run inference of PyTorch checkpoint, follow the instruction below. I recommend you go through apple/ml-fastvlm for further instructions on inference on Apple Silicon and other devices.
python predict.py --model-path /path/to/checkpoint-dir \
--image-file /path/to/image.png \
--prompt "Describe the image."
The prompt I used for the dataset, in training, and in practice is:
You are a vision-language model that analyzes images for context-aware reasoning.
Given a visual scene, generate a rich, structured, and detailed description that includes:\n\n
1. Main Focus – What is the primary object, person, or action in the scene?\n
2. Surrounding Objects & Context – List and describe notable secondary objects, people, or environment details.\n
3. Spatial Relationships – Describe where the objects are relative to one another.\n
4. Activities & Interactions – What are people or objects doing? Are there interactions or implied motions?\n
5. Scene Type & Time – Describe the overall type of scene (e.g. urban street, kitchen, park) and visible time of day.\n
6. Inferences & Intent – Based on visual cues, infer what might have just happened or what might happen next.\n
7. Style & Aesthetic – Describe the scene’s mood, lighting, or style (e.g. bright, moody, colorful).\n\n
Your goal: make your description so complete and detailed that an image generator could reconstruct the scene with full visual accuracy from your output alone.
Training Details
Training Data
- Training data:
riddhimanrana/coco-fastvlm-2k-val2017
- Device: MacBook Pro 16" (M2 Pro, 16GB RAM, Apple Silicon)
- Vision tower:
MobileCLIP-S0
- Lora parameters:
r=128
alpha=256
Dropout = 0.1
- Applied to the language model using PEFT
- Epochs:
1
- Model max tokens:
512
- Quantization: 4-bit (post-training, MLX conversion)
Training Procedure
Preprocessing
- Image aspect ratio padded to 256×256.
- Object detection tags from YOLOv11n were added at the start of each prompt.
- All prompts followed a structured, 7-point captioning rubric.
- Inputs were clipped at 512 tokens.
Training Hyperparameters
Hyperparameter | Value |
---|---|
Precision | fp32 (Apple Silicon, no bf16/fp16) |
Learning rate | 2e-4 |
Weight decay | 0.0 |
Warmup ratio | 0.03 |
Scheduler | cosine |
Batch size (train) | 8 |
Batch size (eval) | 4 |
Gradient accumulation | 1 |
Max token length | 512 |
Logging steps | 1 |
Evaluation strategy | no |
Save strategy | steps (default step interval) |
Gradient checkpointing | True |
Lazy preprocessing | True |
DataLoader workers | 4 |
Speeds, Sizes, Times
Training duration: ~1.2 hours on M2 Pro (1 epoch over 2k samples)
Peak RAM usage: ~11.5 GB
Merged model size: 3.0 GB (pre-quantization)
Post-quantization size: ~864 MB (MLX-quantized, 4-bit)
Inference memory on iPhone (MLX): ~980MB-1.2GB RAM with 256 token generation
All devices were fed the same image. However, this model is only compatible with iPhone 12 and newer models. It has been tested on iPhone 11, but it doesn’t work due to incompatibility issues with Apple MLX support and smaller neural engines.
Device | Chip | RAM | TTFT | Generation |
---|---|---|---|---|
iPhone 12 | A14 | 4GB | 2392ms | 73.5 tok/s |
iPhone 13 Pro Max | A15 | 6GB | 1138ms | 74.1 tok/s |
iPhone 14 | A15 | 6GB | 1069ms | 71.3 tok/s |
MacBook Air 2020 | M1 | 8GB | 673ms | 131 tok/s |
Evaluation
Testing Data, Factors & Metrics
Testing Data
- A subset of COCO val2017 images was manually evaluated.
- Dataset includes both common and edge cases: animals, street scenes, closeups, occlusion, and indoor scenes.
Factors
- Image complexity (single vs multi-object)
- Scene type (indoor vs outdoor)
- Visual density
- Prompt diversity (7-point rubric compliance)
Metrics
Due to the direction of my current project, evaluation metrics weren’t particularly important so I didn't spend much time on it. However, I am open to community contributions for model evaluation.
- Human Evaluation (1–5 scale):
- Completeness: How well the description matches the visible scene
- Structure: Coherence of the response relative to the 7-part prompt
- Detail & Accuracy: Visual correctness of relationships and entities
- Quantitative (for future release):
- CIDEr / METEOR / BLEU-4 (planned via COCO eval pipeline)
Results
Metric | Avg Score |
---|---|
Completeness | 4.6 / 5 |
Structure | 4.8 / 5 |
Visual Accuracy | 4.5 / 5 |
Summary
The model produces rich, well-structured, and highly relevant captions optimized for real-time mobile inference. With ~930 MB size and <1 GB RAM usage, it is deployable on older iPhones w/o Apple Intelligence(e.g., iPhone 12 or newer). Despite fine-tuning on just 2,000 examples, its reasoning capability generalizes well due to the high-quality distilled prompts.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: MacBook Air M1 (dataset generation), MacBook Pro M2 Pro (training, quantization)
- Hours used: ~3 hours for dataset, ~1h for training
- Compute Region: Local / personal hardware
- Carbon Emitted: Minimal, due to small dataset size and single-device compute.
Citation
BibTeX:
@InProceedings{fastvlm2025,
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025}
}
Model Card Contact
Contact: @riddhimanrana on Hugging Face or GitHub
- Downloads last month
- 41
Model tree for riddhimanrana/fastvlm-0.5b-captions
Base model
zhaode/FastVLM-0.5B-Stage3