Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow
Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for structured workflow generation from sketch images. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.
Model Details
Model Description
Llama-3.2-11B-Vision-Instruct-StarFlow is part of the StarFlow framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.
- Developed by: ServiceNow Research
- Model type: Transformer-based Vision-Language Model (VLM)
- Language(s) (NLP): English
- License: llama3.2
- Finetuned from model : Llama-3.2-11B-Vision-Instruct
Model Sources
- Repository: ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow
- Paper: StarFlow: Generating Structured Workflow Outputs From Sketch Images;
Uses
Direct Use
- Translating sketches of workflows (hand-drawn, whiteboard, or digital diagrams) into JSON structured workflows.
- Supporting workflow automation in enterprise platforms by removing the need for manual low-code configuration.
Downstream Use
- Integration into enterprise low-code platforms for rapid prototyping of workflows by users.
- Used in automation migration pipelines, e.g., converting legacy workflow screenshots into JSON representations.
Out-of-Scope Use
- General-purpose vision-language tasks (e.g., image captioning, OCR).
- Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
- Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).
Bias, Risks, and Limitations
- Limited generalization: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
- Sensitivity to input style: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
- Component naming mismatches: Model may mispredict action definitions (e.g., “create_user” vs. “create_a_user”), leading to execution errors.
- Evaluation gap: Current metrics don’t always reflect execution correctness of generated workflows.
Recommendations
Users should:
- Validate outputs before deployment.
- Be cautious with handwritten/ambiguous sketches.
- Consider supplementing with retrieval-augmented generation (RAG) or tool grounding for robustness.
How to Get Started with the Model
from transformers import AutoProcessor, MllamaForConditionalGeneration
from PIL import Image
processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
image = Image.open("workflow_sketch.png")
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)
print(workflow_json)
Training Details
Training Data
The model was trained using the ServiceNow/BigDocs-Sketch2Flow dataset, which includes the following data distribution:
- Synthetic (12,376 Graphviz-generated diagrams)
- Manual (3,035 sketches hand-drawn by annotators)
- Digital (2,613 diagrams drawn using software)
- Whiteboard (484 sketches drawn on whiteboard / blackboard)
- User Interface (373 screenshots from ServiceNow Flow Designer)
Training Procedure
Preprocessing
- Synthetic workflows generated via heuristics (Scheduled Loop, IF/ELSE, FOREACH, etc.).
- Annotators recreated flows in digital, manual, and whiteboard formats.
Training Hyperparameters
- Optimizer: AdamW with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
- Scheduler: cosine learning rate with 30 warmup steps.
- Early stopping based on validation loss.
- Precision: bf16 mixed-precision.
- Sequence length: up to 32k tokens.
Speeds, Sizes, Times
- Trained with 16× NVIDIA H100 80GB GPUs across two nodes.
- Full Sharded Data Parallel (FSDP) training, no CPU offloading.
Evaluation
Testing Data
Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.
Factors
- Source of sample (synthetic, manual, UI, etc.)
- Orientation (portrait vs. landscape diagrams)
- Resolution (small <400k pixels, medium, large >1M pixels)
Metrics
All Evaluation metrics can be found in the official StarFlow repo.
- Flow Similarity (FlowSim) – tree edit distance similarity.
- TreeBLEU – structural recall of subtrees.
- Trigger Match (TM) – accuracy of workflow triggers.
- Component Match (CM) – overlap of predicted vs. gold components.
Results
Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights without finetuning.
Finetuned Pixtral-12B achieves SOTA:
- FlowSim w/ inputs: 0.919
- TreeBLEU w/ inputs: 0.950
- Trigger Match: 0.753
- Component Match: 0.930
Summary
Finetuning yields large gains over base Pixtral-12B and GPT-4o, particularly in matching workflow components and triggers.
Model Examination
- Finetuned models capture naming conventions and structured execution logic better.
- Failure modes include missing ELSE branches or generic table names.
Technical Specifications
Model Architecture and Objective
- Base: Llama-3.2-11B Vision Instruct, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
- Objective: Image-to-JSON structured workflow generation.
Compute Infrastructure
- Hardware: 16× NVIDIA H100 80GB (2 nodes)
- Software: FSDP, bf16 mixed precision, PyTorch/Transformers
Citation
BibTeX:
@article{bechard2025starflow,
title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
journal={arXiv preprint arXiv:2503.21889},
year={2025}
}
APA: Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). StarFlow: Generating Structured Workflow Outputs from Sketch Images. arXiv preprint arXiv:2503.21889.
Glossary
- FlowSim: Metric based on tree edit distance for workflows.
- TreeBLEU: BLEU-like score using tree structures.
- Trigger Match: Correctness of predicted workflow trigger.
- Component Match: Correctness of predicted components (order-agnostic).
More Information
The StarFlow Team
- Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
Model Card Contact
- Patrice Bechard - [email protected]
- ServiceNow Research – research.servicenow.com
- Downloads last month
- 7
Model tree for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct