hqfang commited on
Commit
c009752
·
verified ·
1 Parent(s): 82f52a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-7B
7
+ - google/siglip2-so400m-patch14-384
8
+ library_name: transformers
9
+ tags:
10
+ - molmoact
11
+ - molmo
12
+ - olmo
13
+ - vla
14
+ - robotics
15
+ - manipulation
16
+ ---
17
+
18
+ # MolmoAct 7B-D Pretrain
19
+
20
+ MolmoAct is a fully open-source action reasoning model for robotic manipulation developed by the Allen Institute for AI. MolmoAct is trained on a subset of OXE and MolmoAct Dataset, a dataset with 10k high-quality trajectories of a single-arm Franka robot performing 93 unique manipulation tasks in both home and tabletop environments. It has state-of-the-art performance among vision-language-action models on multiple benchmarks with a similar size while being fully open-source. You can find all models in the MolmoAct family [here](https://huggingface.co/collections/allenai/molmoact-689697591a3936fba38174d7).
21
+ **Learn more** about MolmoAct [in our announcement blog post](https://molmo.allenai.org/blog) or the [paper](https://huggingface.co/papers/2409.17146).
22
+
23
+ MolmoAct 7B-D Pretrain is based on [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) and uses [SigLip2](https://huggingface.co/google/siglip2-so400m-patch14-384) as vision backbone.
24
+
25
+ This checkpoint is a **preview** of the MolmoAct release. All artifacts used in creating MolmoAct (MolmoAct dataset, training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.
26
+
27
+ Quick links:
28
+ - 💬 [Demo](https://molmo.allenai.org/)
29
+ - 📂 [All Models](https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19)
30
+ - 📃 [Paper](https://molmo.allenai.org/paper.pdf)
31
+ - 🎥 [Blog with Videos](https://molmo.allenai.org/blog)
32
+
33
+
34
+ ## Quick Start
35
+
36
+ To run MolmoAct, first install dependencies:
37
+
38
+ ```bash
39
+ pip install einops torchvision
40
+ pip install transformers==4.52
41
+ ```
42
+
43
+ Then, follow these steps:
44
+
45
+ ```python
46
+ from transformers import AutoProcessor, AutoModelForImageTextToText
47
+ import torch
48
+ from PIL import Image
49
+ import requests
50
+
51
+ ckpt = "allenai/MolmoAct-7B-D-Pretrain-0812"
52
+
53
+ # load the processor
54
+ processor = AutoProcessor.from_pretrained(
55
+ ckpt,
56
+ trust_remote_code=True,
57
+ torch_dtype="auto",
58
+ device_map="auto",
59
+ padding_side="left",
60
+ )
61
+
62
+ # load the model
63
+ model = AutoModelForImageTextToText.from_pretrained(
64
+ ckpt,
65
+ trust_remote_code=True,
66
+ torch_dtype="auto",
67
+ device_map="auto",
68
+ )
69
+
70
+ # task instruction
71
+ instruction = "pick up the black bowl on the stove and place it on the plate"
72
+
73
+ # strictly follow the following reasoning prompt
74
+ prompt = (
75
+ f"The task is {instruction}. "
76
+ "What is the action that the robot should take. "
77
+ f"To figure out the action that the robot should take to {instruction}, "
78
+ "let's think through it step by step. "
79
+ "First, what is the depth map for this image? "
80
+ "Second, what is the trajectory of the end effector? "
81
+ "Based on the depth map of the image and the trajectory of the end effector, "
82
+ "what is the action that the robot should take?"
83
+ )
84
+
85
+ # apply chat template
86
+ text = processor.apply_chat_template(
87
+ [
88
+ {
89
+ "role": "user",
90
+ "content": [dict(type="text", text=prompt)]
91
+ }
92
+ ],
93
+ tokenize=False,
94
+ add_generation_prompt=True,
95
+ )
96
+
97
+ # image observation
98
+ img = Image.open("/weka/oe-training-default/jiafeid/oxe-images/images/fractal20220817_data/0000000/0000.png")
99
+ imgs = [img]
100
+
101
+ # process the image and text
102
+ inputs = processor(
103
+ images=[imgs],
104
+ text=text,
105
+ padding=True,
106
+ return_tensors="pt",
107
+ )
108
+
109
+ # move inputs to the correct device
110
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
111
+
112
+ # generate output
113
+ with torch.inference_mode():
114
+ with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
115
+ generated_ids = model.generate(**inputs, max_new_tokens=448)
116
+
117
+ # only get generated tokens; decode them to text
118
+ generated_tokens = generated_ids[:, inputs['input_ids'].size(1):]
119
+ generated_text = processor.batch_decode(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
120
+
121
+ # print the generated text
122
+ print(generated_text)
123
+
124
+ # >>> The depth map of the image is ... The trajectory of the end effector is ...
125
+ # Based on these information, the action that the robot should take is ...
126
+
127
+ # parse out all depth perception tokens
128
+ depth = model.parse_depth(generated_text)
129
+ print(f"generated depth perception tokens: {depth}")
130
+
131
+ # >>> [ <DEPTH_START><DEPTH_1><DEPTH_2>...<DEPTH_END> ]
132
+
133
+ # parse out all visual reasoning traces
134
+ trace = model.parse_trace(generated_text)
135
+ print(f"generated visual reasoning trace: {trace}")
136
+
137
+ # >>> [ [[242, 115], [140, 77], [94, 58], [140, 44], [153, 26]]] ]
138
+
139
+ # parse out all actions, using unnorm key of fractal20220817_data
140
+ action = model.parse_action(generated_text, unnorm_key="fractal20220817_data")
141
+ print(f"generated action: {action}")
142
+
143
+ # >>> [ [0.0732076061122558, 0.08228153779226191, -0.027760173818644346,
144
+ # 0.15932856272248652, -0.09686601126895233, 0.043916773912953344,
145
+ # 0.996078431372549] ]
146
+ ```
147
+
148
+ ## License and Use
149
+
150
+ This model is licensed under Apache 2.0. It is intended for research and educational use.
151
+ For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
152
+
153
+
154
+ ## Model and Hardware Safety
155
+ MolmoAct offers the ability to inspect a visual trace of its intended actions in space before they occur, allowing users to ensure safe behavior by proactively auditing and adjusting the actions of any hardware acting under the model’s instructions. MolmoAct’s action space is bounded within the data provided, and compliance is built into the model to prevent excessive force when resistance is detected. Please follow the hardware manufacturer’s guidelines when using this model with a robot and perform all operations in a safely configured environment.