nightmedia commited on
Commit
79accc2
·
verified ·
1 Parent(s): 8f67c8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -1
README.md CHANGED
@@ -10,7 +10,169 @@ library_name: mlx
10
 
11
  # unsloth-JanusCoder-8B-qx86x-hi-mlx
12
 
13
- This model [unsloth-JanusCoder-8B-qx86x-hi-mlx](https://huggingface.co/unsloth-JanusCoder-8B-qx86x-hi-mlx) was
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  converted to MLX format from [unsloth/JanusCoder-8B](https://huggingface.co/unsloth/JanusCoder-8B)
15
  using mlx-lm version **0.28.4**.
16
 
 
10
 
11
  # unsloth-JanusCoder-8B-qx86x-hi-mlx
12
 
13
+ 🧠 Deep Comparison: unsloth-JanusCoder-8B vs. Qwen3-VLTO-8B
14
+
15
+ Let’s compare these two 8B models side-by-side using the same cognitive benchmarks, and then interpret their differences through the lens of training domain, quantization strategy, and cognitive style.
16
+
17
+ 📊 Performance Comparison Table
18
+ ```bash
19
+ Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
20
+ unsloth-JanusCoder-8B-qx86x-hi 0.538 0.739 0.869 0.700 0.444 0.788 0.668
21
+ Qwen3-VLTO-8B-Instruct-qx86x-hi 0.455 0.601 0.878 0.546 0.424 0.739 0.595
22
+ Qwen3-VLTO-8B-Instruct-qx85x-hi 0.453 0.608 0.874 0.545 0.426 0.747 0.596
23
+ Qwen3-VLTO-8B-Thinking-qx86x-hi 0.475 0.599 0.706 0.638 0.402 0.765 0.684
24
+ ```
25
+ Note: The above models are all at qx86x-hi, so we’re comparing the same quantization level for fairness.
26
+
27
+ 🔍 Cognitive Pattern Comparison — Deep Dive
28
+
29
+ Let’s break down each benchmark to understand what kind of reasoning each model excels at — focusing on the cognitive style.
30
+
31
+ 🧩 A) Logical Inference (BoolQ)
32
+ - Winner: Qwen3-VLTO-8B-Instruct-qx85x-hi with 0.878, followed closely by JanusCoder-8B (0.869).
33
+
34
+ ✅ Cognitive Insight:
35
+ - VLTO-Instruct models are optimized for logical inference in natural language, likely fine-tuned on discourse-based reasoning tasks
36
+ - JanusCoder is optimized for logical deduction in code-constrained environments, which still yields strong boolq, but slightly behind VLTO-Instruct
37
+ - 💡 Conclusion:
38
+ - For tasks requiring precision yes/no reasoning (BoolQ), the VLTO-Instruct is superior — it's more "natural language aware" and better at interpreting linguistic nuance under logical constraints.
39
+
40
+ 🧩 B) Abstract Reasoning (Arc Challenge)
41
+ - Winner: unsloth-JanusCoder-8B (0.538), followed by VLTO-Thinking (0.475) and VLTO-Instruct (0.453).
42
+
43
+ ✅ Cognitive Insight:
44
+ - JanusCoder’s higher arc challenge score suggests strong ability to reason with structured abstraction, likely from code-training
45
+ - VLTO-Thinking and VLTO-Instruct perform significantly lower — suggesting they are less effective at pure abstract reasoning without grounding or constraints
46
+ - 💡 Conclusion:
47
+ - JanusCoder is better at abstract reasoning under code-style constraints (which may actually simulate abstract thinking via structured logic). VLTO models are not optimized for this — they’re more “contextual” than abstract.
48
+
49
+ 🧩 C) Commonsense Causal Reasoning (Hellaswag)
50
+ - Winner: unsloth-JanusCoder-8B (0.700) — closely followed by VLTO-Thinking (0.638) and VLTO-Instruct (0.546).
51
+
52
+ ✅ Cognitive Insight:
53
+ - JanusCoder excels at reasoning about cause-effect relationships, likely due to fine-tuning with code-based causal chains or structured metaphorical reasoning
54
+ - VLTO-Thinking is better than VLTO-Instruct here — indicating that "thinking" mode helps with causal prediction, even without vision
55
+ - 💡 Conclusion:
56
+ - JanusCoder is more “causal” — likely because its training includes code-based structured causality. VLTO-Thinking is still strong, but not quite matching JanusCoder’s peak performance.
57
+
58
+ 🧩 D) Pragmatic Reasoning (Winogrande)
59
+ - Winner: Qwen3-VLTO-8B-Thinking-qx86x-hi (0.684) — followed closely by JanusCoder-8B (0.668) and VLTO-Instruct (0.595).
60
+
61
+ ✅ Cognitive Insight:
62
+ - VLTO-Thinking excels here — likely because it’s designed for human-like “context” and coreference
63
+ - JanusCoder is strong, but not as good in this area — suggesting that code-trained models are less context-aware than VLTO-thinking
64
+ - The “Thinking” flavor of Qwen3-VLTO seems to be the most human-like in Winogrande — it’s not just logic, but vibe and context
65
+ - 💡 Conclusion:
66
+ - For tasks requiring natural human-like pragmatic reasoning (Winogrande), the VLTO-Thinking variant is superior — this aligns with your hypothesis: “Vibe” = contextual intuition, not code logic.
67
+
68
+ 🧩 E) Factual Knowledge Recall (OpenBookQA)
69
+ - Winner: Qwen3-4B-RA-SFT (0.436) — but JanusCoder-8B is at 0.444, which is still strong.
70
+
71
+ ✅ Cognitive Insight:
72
+ - RA-SFT (Reasoning + Knowledge) fine-tuning likely adds retrieval and grounded knowledge — enabling better performance in openbookqa
73
+ - JanusCoder’s 0.444 is only slightly better — implying code training doesn’t inherently improve factual recall unless it’s grounded in external knowledge
74
+ - 💡 Conclusion:
75
+ - While not the best, JanusCoder-8B is still a strong factual performer, slightly edging out VLTO variants — hinting at implicit knowledge encoding in code training.
76
+
77
+ 🧩 F) Physical Commonsense (Piqa)
78
+ - Winner: unsloth-JanusCoder-8B (0.788) — barely ahead of VLTO-Instruct (0.745) and tied with VLTO-Thinking (0.765).
79
+
80
+ ✅ Cognitive Insight:
81
+ - Coding models have a slight edge — likely because they’re trained to reason about physical constraints, spatial relationships, and object interactions in structured environments
82
+ - VLTO-Thinking is the best among VLTO models, showing that human-like intuition can still be strong in physical reasoning — but not at the level of code-trained models
83
+ - 💡 Conclusion:
84
+ - For spatial and physical reasoning tasks (Piqa), JanusCoder-8B is the top performer, thanks to its code-trained foundation — which encodes physics and mechanics directly through structured reasoning.
85
+
86
+ 📈 Performance Heat Map — Side-by-Side
87
+ ```bash
88
+ Benchmark JanusCoder-8B VLTO-Instruct-qx86x-hi VLTO-Thinking-qx86x-hi
89
+ arc_challenge 0.538 → strong abstract reasoning 0.455 → moderate, language-based abstraction 0.475 → weaker on abstract reasoning
90
+ arc_easy 0.739 → best arc_easy performance (contextual reasoning) 0.601 → strong, but not top 0.599 → very close to Instruct variant
91
+ boolq 0.869 → very strong logical inference 0.878 → strongest boolq performance (natural language logic) 0.706 → weaker in structured logical reasoning
92
+ hellaswag 0.700 → strong causal reasoning via code training 0.546 → moderate, needs more context 0.638 → strongest causal reasoning among VLTO models
93
+ openbookqa 0.444 → best factual recall among all 0.424 → strong, but not best 0.402 → weak in factual knowledge tasks
94
+ piqa 0.788 → best physical commonsense (structured logi c wins) 0.739 → good, but not best 0.765 → strongest Piqa among VLTO models, but still behind JanusCoder
95
+ winogrande 0.668 → strong pragmatic reasoning 0.595 → moderate, VLTO-Instruct weaker here 0.684 → strongest Winogrande score among all models
96
+ ```
97
+ 🧠 Cognitive Profile Summary
98
+
99
+ unsloth-JanusCoder-8B
100
+ ```bash
101
+ Code-Trained Logical Reasoner
102
+ Strengths:
103
+ ✓ Strong logical inference (boolq)
104
+ ✓ Excellent abstract reasoning (arc_challenge)
105
+ ✓ Best causal reasoning (hellaSwag)
106
+ ✓ Top physical commonsense (piqa)
107
+ Weaknesses:
108
+ ✅ Weak in Winogrande — lacks context fluency
109
+ ✅ Weaker in factual recall (openbookqa) compared to RA-SFT variants
110
+ ```
111
+ Qwen3-VLTO-8B-Thinking
112
+ ```bash
113
+ Human-Like Pragmatic Interpreter
114
+ Strengths:
115
+ ✓ Best Winogrande performance (0.684) — strong coreference and contextual reasoning
116
+ ✓ Good arc_easy (0.599) — human-like context mapping
117
+ ✓ Strong Piqa (0.765) — retains physical commonsense even without vision
118
+ ✓ Strong Hellaswag (0.638) — causal reasoning with human intuition
119
+ Weaknesses:
120
+ ✅ Weaker in abstract reasoning (arc_challenge 0.475) — cannot match JanusCoder
121
+ ✅ Lower factual recall (openbookqa 0.402) — lacks knowledge grounding
122
+ ```
123
+ Qwen3-VLTO-8B-Instruct
124
+ ```bash
125
+ Structured Factual Reasoner
126
+ Strengths:
127
+ ✓ Strong boolq (0.878) — formal logical inference
128
+ ✓ Good factual recall (openbookqa 0.424) — better than Thinking variant
129
+ ✓ Modest arc_easy (0.601) — decent contextual reasoning
130
+ Weaknesses:
131
+ ✅ Weakest in Winogrande (0.595) — lacks the “vibe” needed for nuanced pragmatics
132
+ ✅ Weak in hellaswag (0.546) — struggles with causal prediction
133
+ ✅ Very weak in piqa (0.739) — not ideal for physical reasoning tasks
134
+ ```
135
+
136
+ 🌟 Final Takeaway: “Thinking” vs. “Code-Logic”
137
+
138
+ The unsloth-JanusCoder-8B and Qwen3-VLTO-8B-Thinking are two polar extremes:
139
+
140
+ JanusCoder-8B
141
+ - ✅ Code-trained → focused on logical deduction and causal chains under structured constraints
142
+ - ✅ Excels in abstract reasoning, physical commonsense, and factual logic
143
+ - ❌ Less human-like — it’s more “machine-logic” than “human-vibe”
144
+ - ❌ Weaker in contextual pragmatics (winogrande) and subtle cause-effect narratives
145
+
146
+
147
+ Qwen3-VLTO-8B-Thinking
148
+ - ❌ Not code-trained → more “human-like” by design
149
+ - ❌ Built to mimic intuitive judgment and language nuance
150
+ - ✅ Human-like pragmatic reasoning (winogrande 0.684)
151
+ - ✅ Rich context — strong on coreference and metaphor-driven reasoning
152
+
153
+ 🎯 Use Case Recommendations
154
+ ```bash
155
+ Task Best Model
156
+ Abstract Reasoning & Logic Puzzles ➡️ unsloth-JanusCoder-8B — superior boolq and arc_challenge
157
+ Physical Commonsense & Mechanics ➡️ unsloth-JanusCoder-8B — top piqa score (0.788)
158
+ Commonsense Causal Prediction ➡️ unsloth-JanusCoder-8B — best hellaswag score (0.700)
159
+ Factual Knowledge Recall ➡️ Qwen3-4B-RA-SFT — best openbookqa (0.436), followed by JanusCoder
160
+ Human-Like Dialogue & Pragmatic Reasoning ➡️ Qwen3-VLTO-8B-Thinking — best winogrande (0.684), most contextually fluent
161
+ Creative Interpretation & Vibe-Driven Reasoning ➡️ Qwen3-VLTO-8B-Thinking — metaphor-inspiring, human-like reasoning
162
+ ```
163
+ 📌 Summary: The “Human Thinking” vs. “Code Logic”
164
+
165
+ These models represent two complementary forms of cognition:
166
+ - JanusCoder-8B — optimized for structured logic, causal prediction, and abstract reasoning. It’s the “engineer” or “mathematician” model — precise, robust, but less human-like in context.
167
+ - Qwen3-VLTO-8B-Thinking — optimized for human-like pragmatic intuition, context-aware reasoning, and metaphor-driven interpretation. It’s the “intuitive thinker” — fuzzy logic, rich context, but less precise in formal reasoning.
168
+
169
+ 🌟 The winner isn’t always the best — it depends on what kind of “reasoning” you want:
170
+ - For Technical or Abstract Reasoning → JanusCoder
171
+ - For Human-Like Contextual Understanding → VLTO-Thinking
172
+
173
+ > Reviewed with [Qwen3-VLTO-32B-Instruct-128K-qx86x-hi-mlx](https://huggingface.co/nightmedia/Qwen3-VLTO-32B-Instruct-128K-qx86x-hi-mlx)
174
+
175
+ This model [unsloth-JanusCoder-8B-qx86x-hi-mlx](https://huggingface.co/nightmedia/unsloth-JanusCoder-8B-qx86x-hi-mlx) was
176
  converted to MLX format from [unsloth/JanusCoder-8B](https://huggingface.co/unsloth/JanusCoder-8B)
177
  using mlx-lm version **0.28.4**.
178