fastvlm-0.5b-captions

Model Details

fastvlm-0.5b-captions is a finetuned version of FastVLM-0.5B Stage 3 from the FastVLM official repository, built for efficient structured image captioning on mobile devices. This model incorporates LoRA fine-tuning, 4-bit quantization, and MobileCLIP-S0 as its vision tower, achieving substantial RAM reductions for embedded inference.

Model Description

Developed by: Riddhiman Rana (fine-tuning and optimizations)
Model type: VLM (Vision-Language Model)
Original model authors: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
Language(s) (NLP): English
License (base model): apple-amlr
Finetuned from model: apple/ml-fastvlm, specifically FastVLM-0.5B Stage 3

Model Sources

Base Model Repository: https://github.com/apple/ml-fastvlm
Fine-tuning Training Dataset: https://huggingface.co/datasets/riddhimanrana/coco-fastvlm-2k-val2017
FastVLM Paper (CVPR 2025): https://www.arxiv.org/abs/2412.13303

Uses

Demo on iPhone 13 Pro Max

Direct Use

Generating highly detailed, structured captions for images on mobile and embedded devices.
Ideal for low-resource environments such as iPhones, MacBooks, and potentially other Apple Silicon devices via MLX and CoreML.
Tested on iPhone 12/13 Pro Max/14 – reaching RAM usage below 1 GB and TTFT as low as 600ms on higher-end iPhones.

Out-of-Scope Use

This is not designed for general-purpose multimodal reasoning beyond descriptive image captioning.
Not suitable for text-only language tasks.

Bias, Risks, and Limitations

Dataset was limited to 2,000 images from COCO 2017 Validation – captions may reflect biases in that dataset.
The model’s structured captions might occasionally be verbose or repetitive depending on input complexity.
Accuracy for extremely abstract or unfamiliar visual scenes may degrade.

Recommendations

How to Get Started with the Model

To run inference of PyTorch checkpoint, follow the instruction below. I recommend you go through apple/ml-fastvlm for further instructions on inference on Apple Silicon and other devices.

python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."

The prompt I used for the dataset, in training, and in practice is:

You are a vision-language model that analyzes images for context-aware reasoning.
Given a visual scene, generate a rich, structured, and detailed description that includes:\n\n
  1. Main Focus – What is the primary object, person, or action in the scene?\n
  2. Surrounding Objects & Context – List and describe notable secondary objects, people, or environment details.\n
  3. Spatial Relationships – Describe where the objects are relative to one another.\n
  4. Activities & Interactions – What are people or objects doing? Are there interactions or implied motions?\n
  5. Scene Type & Time – Describe the overall type of scene (e.g. urban street, kitchen, park) and visible time of day.\n
  6. Inferences & Intent – Based on visual cues, infer what might have just happened or what might happen next.\n
  7. Style & Aesthetic – Describe the scene’s mood, lighting, or style (e.g. bright, moody, colorful).\n\n
  Your goal: make your description so complete and detailed that an image generator could reconstruct the scene with full visual accuracy from your output alone.

Training Details

Training Data

Training data: riddhimanrana/coco-fastvlm-2k-val2017
Device: MacBook Pro 16" (M2 Pro, 16GB RAM, Apple Silicon)
Vision tower: MobileCLIP-S0
Lora parameters:
- r=128
- alpha=256
- Dropout = 0.1
- Applied to the language model using PEFT
Epochs: 1
Model max tokens: 512
Quantization: 4-bit (post-training, MLX conversion)

Training Procedure

Preprocessing

Image aspect ratio padded to 256×256.
Object detection tags from YOLOv11n were added at the start of each prompt.
All prompts followed a structured, 7-point captioning rubric.
Inputs were clipped at 512 tokens.

Training Hyperparameters

Hyperparameter	Value
Precision	`fp32` (Apple Silicon, no bf16/fp16)
Learning rate	`2e-4`
Weight decay	`0.0`
Warmup ratio	`0.03`
Scheduler	`cosine`
Batch size (train)	`8`
Batch size (eval)	`4`
Gradient accumulation	`1`
Max token length	`512`
Logging steps	`1`
Evaluation strategy	`no`
Save strategy	`steps` (default step interval)
Gradient checkpointing	`True`
Lazy preprocessing	`True`
DataLoader workers	`4`

Speeds, Sizes, Times

Training duration: ~1.2 hours on M2 Pro (1 epoch over 2k samples)

Peak RAM usage: ~11.5 GB

Merged model size: 3.0 GB (pre-quantization)

Post-quantization size: ~864 MB (MLX-quantized, 4-bit)

Inference memory on iPhone (MLX): ~980MB-1.2GB RAM with 256 token generation

All devices were fed the same image. However, this model is only compatible with iPhone 12 and newer models. It has been tested on iPhone 11, but it doesn’t work due to incompatibility issues with Apple MLX support and smaller neural engines.

Device	Chip	RAM	TTFT	Generation
iPhone 12	A14	4GB	2392ms	73.5 tok/s
iPhone 13 Pro Max	A15	6GB	1138ms	74.1 tok/s
iPhone 14	A15	6GB	1069ms	71.3 tok/s
MacBook Air 2020	M1	8GB	673ms	131 tok/s

Evaluation

Testing Data, Factors & Metrics

Testing Data

A subset of COCO val2017 images was manually evaluated.
Dataset includes both common and edge cases: animals, street scenes, closeups, occlusion, and indoor scenes.

Factors

Image complexity (single vs multi-object)
Scene type (indoor vs outdoor)
Visual density
Prompt diversity (7-point rubric compliance)

Metrics

Due to the direction of my current project, evaluation metrics weren’t particularly important so I didn't spend much time on it. However, I am open to community contributions for model evaluation.

Human Evaluation (1–5 scale):
- Completeness: How well the description matches the visible scene
- Structure: Coherence of the response relative to the 7-part prompt
- Detail & Accuracy: Visual correctness of relationships and entities
Quantitative (for future release):
- CIDEr / METEOR / BLEU-4 (planned via COCO eval pipeline)

Results

Metric	Avg Score
Completeness	`4.6 / 5`
Structure	`4.8 / 5`
Visual Accuracy	`4.5 / 5`

Summary

The model produces rich, well-structured, and highly relevant captions optimized for real-time mobile inference. With ~930 MB size and <1 GB RAM usage, it is deployable on older iPhones w/o Apple Intelligence(e.g., iPhone 12 or newer). Despite fine-tuning on just 2,000 examples, its reasoning capability generalizes well due to the high-quality distilled prompts.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: MacBook Air M1 (dataset generation), MacBook Pro M2 Pro (training, quantization)
Hours used: ~3 hours for dataset, ~1h for training
Compute Region: Local / personal hardware
Carbon Emitted: Minimal, due to small dataset size and single-device compute.

Citation

BibTeX:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025}
}

Model Card Contact

Contact: @riddhimanrana on Hugging Face or GitHub

riddhimanrana
/

fastvlm-0.5b-captions