fastvlm-0.5b-captions

Model Details

fastvlm-0.5b-captions is a finetuned version of FastVLM-0.5B Stage 3 from the FastVLM official repository, built for efficient structured image captioning on mobile devices. This model incorporates LoRA fine-tuning, 4-bit quantization, and MobileCLIP-S0 as its vision tower, achieving substantial RAM reductions for embedded inference.

Model Description

  • Developed by: Riddhiman Rana (fine-tuning and optimizations)
  • Model type: VLM (Vision-Language Model)
  • Original model authors: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
  • Language(s) (NLP): English
  • License (base model): apple-amlr
  • Finetuned from model: apple/ml-fastvlm, specifically FastVLM-0.5B Stage 3

Model Sources

Uses

FastVLM - iOS App Demo

Demo on iPhone 13 Pro Max

Direct Use

  • Generating highly detailed, structured captions for images on mobile and embedded devices.
  • Ideal for low-resource environments such as iPhones, MacBooks, and potentially other Apple Silicon devices via MLX and CoreML.
  • Tested on iPhone 12/13 Pro Max/14 – reaching RAM usage below 1 GB and TTFT as low as 600ms on higher-end iPhones.

Out-of-Scope Use

  • This is not designed for general-purpose multimodal reasoning beyond descriptive image captioning.
  • Not suitable for text-only language tasks.

Bias, Risks, and Limitations

  • Dataset was limited to 2,000 images from COCO 2017 Validation – captions may reflect biases in that dataset.
  • The model’s structured captions might occasionally be verbose or repetitive depending on input complexity.
  • Accuracy for extremely abstract or unfamiliar visual scenes may degrade.

Recommendations

How to Get Started with the Model

To run inference of PyTorch checkpoint, follow the instruction below. I recommend you go through apple/ml-fastvlm for further instructions on inference on Apple Silicon and other devices.

python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."

The prompt I used for the dataset, in training, and in practice is:

You are a vision-language model that analyzes images for context-aware reasoning.
Given a visual scene, generate a rich, structured, and detailed description that includes:\n\n
  1. Main Focus – What is the primary object, person, or action in the scene?\n
  2. Surrounding Objects & Context – List and describe notable secondary objects, people, or environment details.\n
  3. Spatial Relationships – Describe where the objects are relative to one another.\n
  4. Activities & Interactions – What are people or objects doing? Are there interactions or implied motions?\n
  5. Scene Type & Time – Describe the overall type of scene (e.g. urban street, kitchen, park) and visible time of day.\n
  6. Inferences & Intent – Based on visual cues, infer what might have just happened or what might happen next.\n
  7. Style & Aesthetic – Describe the scene’s mood, lighting, or style (e.g. bright, moody, colorful).\n\n
  Your goal: make your description so complete and detailed that an image generator could reconstruct the scene with full visual accuracy from your output alone.

Training Details

Training Data

  • Training data: riddhimanrana/coco-fastvlm-2k-val2017
  • Device: MacBook Pro 16" (M2 Pro, 16GB RAM, Apple Silicon)
  • Vision tower: MobileCLIP-S0
  • Lora parameters:
    • r=128
    • alpha=256
    • Dropout = 0.1
    • Applied to the language model using PEFT
  • Epochs: 1
  • Model max tokens: 512
  • Quantization: 4-bit (post-training, MLX conversion)

Training Procedure

Preprocessing

  • Image aspect ratio padded to 256×256.
  • Object detection tags from YOLOv11n were added at the start of each prompt.
  • All prompts followed a structured, 7-point captioning rubric.
  • Inputs were clipped at 512 tokens.

Training Hyperparameters

Hyperparameter Value
Precision fp32 (Apple Silicon, no bf16/fp16)
Learning rate 2e-4
Weight decay 0.0
Warmup ratio 0.03
Scheduler cosine
Batch size (train) 8
Batch size (eval) 4
Gradient accumulation 1
Max token length 512
Logging steps 1
Evaluation strategy no
Save strategy steps (default step interval)
Gradient checkpointing True
Lazy preprocessing True
DataLoader workers 4

Speeds, Sizes, Times

Training duration: ~1.2 hours on M2 Pro (1 epoch over 2k samples)

Peak RAM usage: ~11.5 GB

Merged model size: 3.0 GB (pre-quantization)

Post-quantization size: ~864 MB (MLX-quantized, 4-bit)

Inference memory on iPhone (MLX): ~980MB-1.2GB RAM with 256 token generation

All devices were fed the same image. However, this model is only compatible with iPhone 12 and newer models. It has been tested on iPhone 11, but it doesn’t work due to incompatibility issues with Apple MLX support and smaller neural engines.

Device Chip RAM TTFT Generation
iPhone 12 A14 4GB 2392ms 73.5 tok/s
iPhone 13 Pro Max A15 6GB 1138ms 74.1 tok/s
iPhone 14 A15 6GB 1069ms 71.3 tok/s
MacBook Air 2020 M1 8GB 673ms 131 tok/s

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • A subset of COCO val2017 images was manually evaluated.
  • Dataset includes both common and edge cases: animals, street scenes, closeups, occlusion, and indoor scenes.

Factors

  • Image complexity (single vs multi-object)
  • Scene type (indoor vs outdoor)
  • Visual density
  • Prompt diversity (7-point rubric compliance)

Metrics

Due to the direction of my current project, evaluation metrics weren’t particularly important so I didn't spend much time on it. However, I am open to community contributions for model evaluation.

  • Human Evaluation (1–5 scale):
    • Completeness: How well the description matches the visible scene
    • Structure: Coherence of the response relative to the 7-part prompt
    • Detail & Accuracy: Visual correctness of relationships and entities
  • Quantitative (for future release):
    • CIDEr / METEOR / BLEU-4 (planned via COCO eval pipeline)

Results

Metric Avg Score
Completeness 4.6 / 5
Structure 4.8 / 5
Visual Accuracy 4.5 / 5

Summary

The model produces rich, well-structured, and highly relevant captions optimized for real-time mobile inference. With ~930 MB size and <1 GB RAM usage, it is deployable on older iPhones w/o Apple Intelligence(e.g., iPhone 12 or newer). Despite fine-tuning on just 2,000 examples, its reasoning capability generalizes well due to the high-quality distilled prompts.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: MacBook Air M1 (dataset generation), MacBook Pro M2 Pro (training, quantization)
  • Hours used: ~3 hours for dataset, ~1h for training
  • Compute Region: Local / personal hardware
  • Carbon Emitted: Minimal, due to small dataset size and single-device compute.

Citation

BibTeX:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025}
}

Model Card Contact

Contact: @riddhimanrana on Hugging Face or GitHub

Downloads last month
41
Safetensors
Model size
99.1M params
Tensor type
F16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for riddhimanrana/fastvlm-0.5b-captions

Finetuned
(1)
this model

Dataset used to train riddhimanrana/fastvlm-0.5b-captions