Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for structured workflow generation from sketch images. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.

Model Details

Model Description

Llama-3.2-11B-Vision-Instruct-StarFlow is part of the StarFlow framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.

Developed by: ServiceNow Research
Model type: Transformer-based Vision-Language Model (VLM)
Language(s) (NLP): English
License: llama3.2
Finetuned from model : Llama-3.2-11B-Vision-Instruct

Model Sources

Repository: ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow
Paper: StarFlow: Generating Structured Workflow Outputs From Sketch Images;

Uses

Direct Use

Translating sketches of workflows (hand-drawn, whiteboard, or digital diagrams) into JSON structured workflows.
Supporting workflow automation in enterprise platforms by removing the need for manual low-code configuration.

Downstream Use

Integration into enterprise low-code platforms for rapid prototyping of workflows by users.
Used in automation migration pipelines, e.g., converting legacy workflow screenshots into JSON representations.

Out-of-Scope Use

General-purpose vision-language tasks (e.g., image captioning, OCR).
Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).

Bias, Risks, and Limitations

Limited generalization: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
Sensitivity to input style: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
Component naming mismatches: Model may mispredict action definitions (e.g., “create_user” vs. “create_a_user”), leading to execution errors.
Evaluation gap: Current metrics don’t always reflect execution correctness of generated workflows.

Recommendations

Users should:

Validate outputs before deployment.
Be cautious with handwritten/ambiguous sketches.
Consider supplementing with retrieval-augmented generation (RAG) or tool grounding for robustness.

How to Get Started with the Model

from transformers import AutoProcessor, MllamaForConditionalGeneration
from PIL import Image

processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")

image = Image.open("workflow_sketch.png")
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)

print(workflow_json)

Training Details

Training Data

The model was trained using the ServiceNow/BigDocs-Sketch2Flow dataset, which includes the following data distribution:

Synthetic (12,376 Graphviz-generated diagrams)
Manual (3,035 sketches hand-drawn by annotators)
Digital (2,613 diagrams drawn using software)
Whiteboard (484 sketches drawn on whiteboard / blackboard)
User Interface (373 screenshots from ServiceNow Flow Designer)

Training Procedure

Preprocessing

Synthetic workflows generated via heuristics (Scheduled Loop, IF/ELSE, FOREACH, etc.).
Annotators recreated flows in digital, manual, and whiteboard formats.

Training Hyperparameters

Optimizer: AdamW with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
Scheduler: cosine learning rate with 30 warmup steps.
Early stopping based on validation loss.
Precision: bf16 mixed-precision.
Sequence length: up to 32k tokens.

Speeds, Sizes, Times

Trained with 16× NVIDIA H100 80GB GPUs across two nodes.
Full Sharded Data Parallel (FSDP) training, no CPU offloading.

Evaluation

Testing Data

Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.

Factors

Source of sample (synthetic, manual, UI, etc.)
Orientation (portrait vs. landscape diagrams)
Resolution (small <400k pixels, medium, large >1M pixels)

Metrics

All Evaluation metrics can be found in the official StarFlow repo.

Flow Similarity (FlowSim) – tree edit distance similarity.
TreeBLEU – structural recall of subtrees.
Trigger Match (TM) – accuracy of workflow triggers.
Component Match (CM) – overlap of predicted vs. gold components.

Results

Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights without finetuning.
Finetuned Pixtral-12B achieves SOTA:
- FlowSim w/ inputs: 0.919
- TreeBLEU w/ inputs: 0.950
- Trigger Match: 0.753
- Component Match: 0.930

Summary

Finetuning yields large gains over base Pixtral-12B and GPT-4o, particularly in matching workflow components and triggers.

Model Examination

Finetuned models capture naming conventions and structured execution logic better.
Failure modes include missing ELSE branches or generic table names.

Technical Specifications

Model Architecture and Objective

Base: Llama-3.2-11B Vision Instruct, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
Objective: Image-to-JSON structured workflow generation.

Compute Infrastructure

Hardware: 16× NVIDIA H100 80GB (2 nodes)
Software: FSDP, bf16 mixed precision, PyTorch/Transformers

Citation

BibTeX:

@article{bechard2025starflow,
  title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
  author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
  journal={arXiv preprint arXiv:2503.21889},
  year={2025}
}

APA: Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). StarFlow: Generating Structured Workflow Outputs from Sketch Images. arXiv preprint arXiv:2503.21889.

Glossary

FlowSim: Metric based on tree edit distance for workflows.
TreeBLEU: BLEU-like score using tree structures.
Trigger Match: Correctness of predicted workflow trigger.
Component Match: Correctness of predicted components (order-agnostic).

More Information

The StarFlow Team

Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian

Model Card Contact

Patrice Bechard - [email protected]
ServiceNow Research – research.servicenow.com

ServiceNow
/

Llama-3.2-11B-Vision-Instruct-StarFlow