LZXzju
/

Qwen2.5-VL-3B-UI-R1-E

@@ -33,3 +33,108 @@ Project page: https://github.com/lll6gg/UI-R1
 | GUI-R1-3B      | w/ thinking    | 114             | 26.6             |
 | UI-R1-3B (v2)  | w/ thinking    | 129             | 29.8             |
 | **UI-R1-E-3B**     | w/o thinking   | **28**          | **33.5**         |

 | GUI-R1-3B      | w/ thinking    | 114             | 26.6             |
 | UI-R1-3B (v2)  | w/ thinking    | 129             | 29.8             |
 | **UI-R1-E-3B**     | w/o thinking   | **28**          | **33.5**         |
+## Evaluation Method for GUI Grounding
+1. Prompt for UI-R1-E-3B：
+   ```python
+   model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+       args.model_path,
+       torch_dtype=torch.bfloat16,
+       attn_implementation="flash_attention_2",
+       device_map="cpu",
+   )
+   model = model.to(torch.device(rank))
+   model = model.eval()
+   processor = AutoProcessor.from_pretrained(ori_processor_path)
+   question_template = (
+       f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
+       "Please provide the action to perform (enumerate in ['click'])"
+       "and the coordinate where the cursor is moved to(integer) if click is performed.\n"
+       "Output the final answer in <answer> </answer> tags directly."
+       "The output answer format should be as follows:\n"
+       "<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
+       "Please strictly follow the format."
+   )
+   query = '<image>\n' + question_template
+   messages = [
+       {
+           "role": "user",
+           "content": [
+               {"type": "image", "image": image_path}
+           ] + [{"type": "text", "text": query}],
+       }
+   ]
+   text = processor.apply_chat_template(
+       messages, tokenize=False, add_generation_prompt=True
+   )
+   image_inputs, video_inputs = process_vision_info(messages)
+   inputs = processor(
+       text=[text],
+       images=image_inputs,
+       videos=video_inputs,
+       padding=True,
+       return_tensors="pt",
+   )
+   generated_ids = model.generate(**inputs, max_new_tokens=1024)
+   generated_ids_trimmed = [
+       out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+   ]
+   response = processor.batch_decode(
+       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+   )
+   response = response[0]
+   pred_coord, _ = extract_coord(response)
+   ```
+2. Rescale the predicted coordinate according to the image resize
+   ```python
+   image = Image.open(image_path)
+   origin_width, origin_height = image.size
+   resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
+   scale_x = origin_width / resized_width
+   scale_y = origin_height / resized_height
+   pred_coord[0] = int(pred_coord[0] * scale_x)
+   pred_coord[1] = int(pred_coord[1] * scale_y)
+   ```
+   Function smart_resize is from Qwen2VL：
+   ```python
+   import math
+   def smart_resize(
+       height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
+   ):
+       """Rescales the image so that the following conditions are met:
+       1. Both dimensions (height and width) are divisible by 'factor'.
+       2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
+       3. The aspect ratio of the image is maintained as closely as possible.
+       """
+       if height < factor or width < factor:
+           raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
+       elif max(height, width) / min(height, width) > 200:
+           raise ValueError(
+               f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
+           )
+       h_bar = round(height / factor) * factor
+       w_bar = round(width / factor) * factor
+       if h_bar * w_bar > max_pixels:
+           beta = math.sqrt((height * width) / max_pixels)
+           h_bar = math.floor(height / beta / factor) * factor
+           w_bar = math.floor(width / beta / factor) * factor
+       elif h_bar * w_bar < min_pixels:
+           beta = math.sqrt(min_pixels / (height * width))
+           h_bar = math.ceil(height * beta / factor) * factor
+           w_bar = math.ceil(width * beta / factor) * factor
+       return h_bar, w_bar
+   ```