Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for structured workflow generation from sketch images. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.

Model Details

Model Description

Llama-3.2-11B-Vision-Instruct-StarFlow is part of the StarFlow framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.

  • Developed by: ServiceNow Research
  • Model type: Transformer-based Vision-Language Model (VLM)
  • Language(s) (NLP): English
  • License: llama3.2
  • Finetuned from model : Llama-3.2-11B-Vision-Instruct

Model Sources


Uses

Direct Use

  • Translating sketches of workflows (hand-drawn, whiteboard, or digital diagrams) into JSON structured workflows.
  • Supporting workflow automation in enterprise platforms by removing the need for manual low-code configuration.

Downstream Use

  • Integration into enterprise low-code platforms for rapid prototyping of workflows by users.
  • Used in automation migration pipelines, e.g., converting legacy workflow screenshots into JSON representations.

Out-of-Scope Use

  • General-purpose vision-language tasks (e.g., image captioning, OCR).
  • Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
  • Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).

Bias, Risks, and Limitations

  • Limited generalization: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
  • Sensitivity to input style: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
  • Component naming mismatches: Model may mispredict action definitions (e.g., “create_user” vs. “create_a_user”), leading to execution errors.
  • Evaluation gap: Current metrics don’t always reflect execution correctness of generated workflows.

Recommendations

Users should:

  • Validate outputs before deployment.
  • Be cautious with handwritten/ambiguous sketches.
  • Consider supplementing with retrieval-augmented generation (RAG) or tool grounding for robustness.

How to Get Started with the Model

from transformers import AutoProcessor, MllamaForConditionalGeneration
from PIL import Image

processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")

image = Image.open("workflow_sketch.png")
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)

print(workflow_json)

Training Details

Training Data

The model was trained using the ServiceNow/BigDocs-Sketch2Flow dataset, which includes the following data distribution:

  • Synthetic (12,376 Graphviz-generated diagrams)
  • Manual (3,035 sketches hand-drawn by annotators)
  • Digital (2,613 diagrams drawn using software)
  • Whiteboard (484 sketches drawn on whiteboard / blackboard)
  • User Interface (373 screenshots from ServiceNow Flow Designer)

Training Procedure

Preprocessing

  • Synthetic workflows generated via heuristics (Scheduled Loop, IF/ELSE, FOREACH, etc.).
  • Annotators recreated flows in digital, manual, and whiteboard formats.

Training Hyperparameters

  • Optimizer: AdamW with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
  • Scheduler: cosine learning rate with 30 warmup steps.
  • Early stopping based on validation loss.
  • Precision: bf16 mixed-precision.
  • Sequence length: up to 32k tokens.

Speeds, Sizes, Times

  • Trained with 16× NVIDIA H100 80GB GPUs across two nodes.
  • Full Sharded Data Parallel (FSDP) training, no CPU offloading.

Evaluation

Testing Data

Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.

Factors

  • Source of sample (synthetic, manual, UI, etc.)
  • Orientation (portrait vs. landscape diagrams)
  • Resolution (small <400k pixels, medium, large >1M pixels)

Metrics

All Evaluation metrics can be found in the official StarFlow repo.

  • Flow Similarity (FlowSim) – tree edit distance similarity.
  • TreeBLEU – structural recall of subtrees.
  • Trigger Match (TM) – accuracy of workflow triggers.
  • Component Match (CM) – overlap of predicted vs. gold components.

Results

  • Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights without finetuning.

  • Finetuned Pixtral-12B achieves SOTA:

    • FlowSim w/ inputs: 0.919
    • TreeBLEU w/ inputs: 0.950
    • Trigger Match: 0.753
    • Component Match: 0.930

Summary

Finetuning yields large gains over base Pixtral-12B and GPT-4o, particularly in matching workflow components and triggers.

Model Examination

  • Finetuned models capture naming conventions and structured execution logic better.
  • Failure modes include missing ELSE branches or generic table names.

Technical Specifications

Model Architecture and Objective

  • Base: Llama-3.2-11B Vision Instruct, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
  • Objective: Image-to-JSON structured workflow generation.

Compute Infrastructure

  • Hardware: 16× NVIDIA H100 80GB (2 nodes)
  • Software: FSDP, bf16 mixed precision, PyTorch/Transformers

Citation

BibTeX:

@article{bechard2025starflow,
  title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
  author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
  journal={arXiv preprint arXiv:2503.21889},
  year={2025}
}

APA: Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). StarFlow: Generating Structured Workflow Outputs from Sketch Images. arXiv preprint arXiv:2503.21889.


Glossary

  • FlowSim: Metric based on tree edit distance for workflows.
  • TreeBLEU: BLEU-like score using tree structures.
  • Trigger Match: Correctness of predicted workflow trigger.
  • Component Match: Correctness of predicted components (order-agnostic).

More Information


The StarFlow Team

  • Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian

Model Card Contact

Downloads last month
7
Safetensors
Model size
10.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

Finetuned
(140)
this model

Dataset used to train ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow