microsoft
/

OmniParser-v2.0

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

yadonglu commited on 9 days ago

Commit

71f73b6

·

1 Parent(s): 1863367

fix readme

Files changed (2) hide show

README.md +2 -2
config.json +0 -0

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ library_name: transformers
 license: mit
 pipeline_tag: image-text-to-text
 ---
-📢 [[Project Page](https://microsoft.github.io/OmniParser/)] [[OmniParser V2 Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/)]
 # Model Summary
 OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent.
@@ -15,7 +15,7 @@ This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 mod
 - Larger and cleaner set of icon caption + grounding dataset
 - 60% improvement in latency compared to V1
 - Strong performance: 39.6 average accuracy on [ScreenSpot Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding)
-- Your agent only need one tool: OmniTool. Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following vision models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Check out our github repo for details.
 # Responsible AI Considerations

 license: mit
 pipeline_tag: image-text-to-text
 ---
+📢 [[GitHub Repo](https://github.com/microsoft/OmniParser/tree/master)] [[OmniParser V2 Blog Post](https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/)]
 # Model Summary
 OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent.
 - Larger and cleaner set of icon caption + grounding dataset
 - 60% improvement in latency compared to V1
 - Strong performance: 39.6 average accuracy on [ScreenSpot Pro](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding)
+- Your agent only need one tool: OmniTool. Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Check out our github repo for details.
 # Responsible AI Considerations

config.json ADDED Viewed

File without changes