aashish1904 commited on
Commit
7274835
·
verified ·
1 Parent(s): 05f54ff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +171 -0
README.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ base_model: jan-hq/AlphaMaze-v0.2-1.5B
5
+ tags:
6
+ - text-generation-inference
7
+ - transformers
8
+ - unsloth
9
+ - qwen2
10
+ - trl
11
+ license: apache-2.0
12
+ language:
13
+ - en
14
+
15
+ ---
16
+
17
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
18
+
19
+
20
+ # QuantFactory/AlphaMaze-v0.2-1.5B-GGUF
21
+ This is quantized version of [homebrewltd/AlphaMaze-v0.2-1.5B](https://huggingface.co/homebrewltd/AlphaMaze-v0.2-1.5B) created using llama.cpp
22
+
23
+ # Original Model Card
24
+
25
+
26
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
27
+
28
+ <div align="center">
29
+
30
+ # AlphaMaze: Teaching LLMs to Think Visually
31
+ <!---
32
+ <a href='https://homebrew.ltd/blog/alpha-maze'><img src='https://img.shields.io/badge/Project-Blog-Green'></a>
33
+ <a href='https://huggingface.co/homebrewltd/AlphaMaze-v0.2-1.5B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
34
+ <a href='https://huggingface.co/datasets/homebrewltd/Maze-Reasoning-v0.1'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>
35
+ <a href='https://alphamaze.menlo.ai/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a>
36
+ <a href=''><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
37
+ -->
38
+
39
+ [**About**](#About) | [**Demo**](#Demo) | [**Models and Datasets**](#Models-and-Dataset) | [**Benchmarks**](#Benchmarks) | [**How to Run Locally**](#Run-Locally) |
40
+
41
+ <img src="./alphamaze.gif" width="400"/>
42
+ </div>
43
+
44
+ ## About
45
+ Developed by **Menlo Research**, **AlphaMaze** is a novel model for evaluating and enhancing visual reasoning in LLMs. AlphaMaze challenges models with a deceptively simple task: solving mazes presented entirely in text. We further enhance AlphaMaze's capabilities using the GRPO (Generalized Relative Policy Optimization) method.
46
+
47
+ Prior research, like [Microsoft's "Multimodal Visualization-of-Thought (MVoT)"](https://arxiv.org/abs/2501.07542), explored visual reasoning through image generation. But AlphaMaze takes a different, more focused path. We believe that if a model can internally reconstruct a maze from a text description and use that *mental map* to plan its moves, it demonstrates a genuine capacity for visual reasoning – even without generating a single image. AlphaMaze moves beyond the limitations of multiple-choice evaluations, providing a richer, more nuanced assessment of a model's spatial understanding. We're not just testing if a model *can* solve a maze; we're revealing *how* it thinks about space.
48
+
49
+ ## Demo
50
+
51
+ AlphaMaze tackle a text-based maze! See how it interprets the maze, plans its moves, and strategically resets when it encounters a dead end.
52
+
53
+ [![Watch the AlphaMaze Demo](https://img.youtube.com/vi/dUS9wR03on8/0.jpg)](https://www.youtube.com/watch?v=dUS9wR03on8)
54
+
55
+ Alternatively, you can explore it on our [demo website](https://alphamaze.menlo.ai/).
56
+
57
+ ## Models and Datasets
58
+
59
+ ### Models
60
+
61
+ You can find our AlphaMaze models on Hugging Face 🤗! We're committed to open-source and easy access for the research community.
62
+
63
+ | Model | Backbone | Size | Link |
64
+ |--------------|--------------------------------------------------------------------------|-------|----------------------------------------------------------------------------|
65
+ | AlphaMaze-v0.1 | [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 1.5B | [🤗 AlphaMaze-v0.1](https://huggingface.co/homebrewltd/AlphaMaze-v0.2-1.5B) |
66
+
67
+ ### Datasets
68
+
69
+ We've released our datasets on Hugging Face 🤗 to support reproducibility and further research.
70
+
71
+ | Dataset | Description | Size | Link |
72
+ |--------------------------------------|-----------------------------------------------------|-------|-----------------------------------------------------------------------------------------|
73
+ | Maze-Reasoning-v0.1 | Training set used for Supervised Fine-Tuning (SFT) | 420k | [🤗 Maze-Reasoning-v0.1](https://huggingface.co/datasets/homebrewltd/Maze-Reasoning-v0.1) |
74
+ | Maze-Reasoning-Reset-v0.1 | Training set for SFT, including reset actions | 50k | [🤗 Maze-Reasoning-Reset-v0.1](https://huggingface.co/datasets/homebrewltd/Maze-Reasoning-Reset-v0.1) |
75
+ | Maze-Reasoning-GRPO-v0.1 | Training set used for GRPO model | 180k | [🤗 Maze-Reasoning-GRPO-v0.1](https://huggingface.co/datasets/homebrewltd/Maze-Reasoning-GRPO-v0.1) |
76
+
77
+ ## Benchmarks
78
+
79
+ ### Supervised Fine-Tuning (SFT)
80
+
81
+ We used [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for Supervised Fine-Tuning (SFT) of our AlphaMaze model. Here's a summary of our key training runs:
82
+
83
+ | Run ID | Model Config | Dataset | Steps | Final Loss | Hardware | Key Findings |
84
+ |--------|-------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|-------|------------|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
85
+ | exp-1 | [Full Finetune (Qwen-1.5B)](https://github.com/janhq/visual-thinker/blob/main/training/Llama-factory-config/Qwen2.5_1.5B_distil.yaml) | [Maze Reasoning](https://huggingface.co/datasets/jan-hq/Maze-Reasoning) | 3125 | 0.01 | ~1.5 hours on 6xA6000 | Initial run with new maze tokens. Observed lower performance. |
86
+ | exp-2 | [Full Finetune (Qwen-1.5B)](https://github.com/janhq/visual-thinker/blob/main/training/Llama-factory-config/Qwen2.5_1.5B_distil.yaml) | [Maze Reasoning](https://huggingface.co/datasets/jan-hq/Maze-Reasoning) | 3125 | 0.01 | ~1.5 hours on 6xA6000 | Trained using pure text descriptions (no new tokens). Surprisingly strong performance. |
87
+ | exp-3 | [Full Finetune (Qwen-7B)](https://github.com/janhq/visual-thinker/blob/main/training/Llama-factory-config/Qwen2.5_7B_distil.yaml) | [Maze Reasoning](https://huggingface.co/datasets/jan-hq/Maze-Reasoning) | 2344 | 0.0077 | ~12 hours on 6xA6000 | Extended training with pure text descriptions (larger model). |
88
+ | exp-4 | [Full Finetune (Qwen-1.5B)](https://github.com/janhq/visual-thinker/blob/main/training/Llama-factory-config/Qwen2.5_1.5B_distil.yaml) | [Maze Reasoning](https://huggingface.co/datasets/jan-hq/Maze-Reasoning) | 2344 | ~0 | ~1.5 hours on 6xA6000 | Further extended training with pure text descriptions. Near-zero loss. |
89
+ | exp-5 | [Full Finetune (Qwen-1.5B)](https://github.com/janhq/visual-thinker/blob/main/training/Llama-factory-config/Qwen2.5_1.5B_distil.yaml) | [Maze Reasoning](https://huggingface.co/datasets/jan-hq/Maze-Reasoning) | 3681 | ~0.02 | ~1 hours on 8xH200 | Experiment with new maze tokens and different hardware. |
90
+
91
+ **Key Observations from SFT:**
92
+
93
+ * Adding new maze-specific tokens did *not* improve performance, and in some cases, resulted in worse results.
94
+ * Surprisingly, the model performed well even with *pure text descriptions*, suggesting a strong ability to learn spatial relationships from text alone.
95
+ * Loss value equal to 0 is concerning.
96
+
97
+ **Note:** These results suggest that reducing token complexity can lead to improved performance in translating spatial information into language.
98
+
99
+ ### Generalized Reward-based Policy Optimization (GRPO)
100
+
101
+ We employed [Unsloth](https://unsloth.ai/) for Generalized Reward-based Policy Optimization (GRPO) to further refine the model's maze-solving policy.
102
+
103
+ The plot below shows the MazeBench scores (blue crosses) achieved during GRPO training, along with a linear regression trendline (red dashed line). The upward trend demonstrates that GRPO effectively guides the model towards improved maze-solving strategies.
104
+
105
+ ![GRPO Training Progress](./grpo_progress.png)
106
+ _GRPO training progress, showing MazeBench scores over training steps._
107
+
108
+
109
+ ## Run Locally
110
+
111
+ For an example of using AlphaMaze with HuggingFace Transformers:
112
+
113
+ ```python
114
+ import torch
115
+ from transformers import AutoTokenizer, AutoModelForCausalLM
116
+ import flash_attn
117
+
118
+ model_path = "homebrewltd/AlphaMaze-v0.2-1.5B"
119
+
120
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
121
+
122
+ model = AutoModelForCausalLM.from_pretrained(
123
+ model_path,
124
+ torch_dtype=torch.float16,
125
+ device_map="auto",
126
+ attn_implementation="flash_attention_2",
127
+ )
128
+
129
+ maze = """You are a helpful assistant that solves mazes. You will be given a maze represented by a series of tokens. The tokens represent: - Coordinates: <|row-col|> (e.g., <|0-0|>, <|2-4|>) - Walls: <|no_wall|>, <|up_wall|>, <|down_wall|>, <|left_wall|>, <|right_wall|>, <|up_down_wall|>, etc. - Origin: <|origin|> - Target: <|target|> - Movement: <|up|>, <|down|>, <|left|>, <|right|>, <|blank|> Your task is to output the sequence of movements (<|up|>, <|down|>, <|left|>, <|right|>) required to navigate from the origin to the target, based on the provided maze representation. Think step by step. At each step, predict only the next movement token. Output only the move tokens, separated by spaces. MAZE: <|0-0|><|up_left_wall|><|blank|><|0-1|><|up_down_wall|><|blank|><|0-2|><|up_down_wall|><|blank|><|0-3|><|up_right_wall|><|blank|><|0-4|><|up_left_right_wall|><|blank|> <|1-0|><|down_left_wall|><|blank|><|1-1|><|up_right_wall|><|blank|><|1-2|><|up_left_wall|><|blank|><|1-3|><|down_right_wall|><|blank|><|1-4|><|left_right_wall|><|blank|> <|2-0|><|up_left_wall|><|blank|><|2-1|><|down_right_wall|><|blank|><|2-2|><|down_left_wall|><|blank|><|2-3|><|up_down_wall|><|blank|><|2-4|><|down_right_wall|><|target|> <|3-0|><|left_right_wall|><|blank|><|3-1|><|up_left_wall|><|origin|><|3-2|><|up_right_wall|><|blank|><|3-3|><|up_down_left_wall|><|blank|><|3-4|><|up_right_wall|><|blank|> <|4-0|><|down_left_wall|><|blank|><|4-1|><|down_right_wall|><|blank|><|4-2|><|down_left_wall|><|blank|><|4-3|><|up_down_wall|><|blank|><|4-4|><|down_right_wall|><|blank|>"""
130
+
131
+ messages = [
132
+ {
133
+ "role": "user",
134
+ "content": maze
135
+ }
136
+ ]
137
+
138
+ input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to("cuda")
139
+ generated_ids = model.generate(input_ids, max_new_tokens=2500, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
140
+ response = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_space=True)
141
+ print(f"Solving maze: {response}")
142
+ ```
143
+
144
+ ## Next Steps
145
+ We are exploring further GRPO enhancements to boost maze-solving capabilities. Stay tuned for more updates on how GRPO is paving the way for improved spatial reasoning in LLMs!
146
+
147
+ ## Join Us
148
+
149
+ We're looking for collaborators and plan to expand the model's capabilities to include additional spatial tasks in the future.
150
+
151
+ ## References
152
+
153
+ ```bibtex
154
+ @misc{dao2025alphamazeenhancinglargelanguage,
155
+ title={AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO},
156
+ author={Alan Dao and Dinh Bach Vu},
157
+ year={2025},
158
+ eprint={2502.14669},
159
+ archivePrefix={arXiv},
160
+ primaryClass={cs.CL},
161
+ url={https://arxiv.org/abs/2502.14669},
162
+ }
163
+ ```
164
+
165
+ ## Acknowledgement
166
+
167
+ - [llama-factory](https://github.com/hiyouga/LLaMA-Factory)
168
+ - [unsloth](https://unsloth.ai/)
169
+ - [Deepseek](https://github.com/deepseek-ai/DeepSeek-R1)
170
+ - [Multimodal Visualization-of-Thought (MVoT)](https://arxiv.org/abs/2501.07542)
171
+