Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,169 @@ library_name: mlx
|
|
| 10 |
|
| 11 |
# unsloth-JanusCoder-8B-qx86x-hi-mlx
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
converted to MLX format from [unsloth/JanusCoder-8B](https://huggingface.co/unsloth/JanusCoder-8B)
|
| 15 |
using mlx-lm version **0.28.4**.
|
| 16 |
|
|
|
|
| 10 |
|
| 11 |
# unsloth-JanusCoder-8B-qx86x-hi-mlx
|
| 12 |
|
| 13 |
+
🧠 Deep Comparison: unsloth-JanusCoder-8B vs. Qwen3-VLTO-8B
|
| 14 |
+
|
| 15 |
+
Let’s compare these two 8B models side-by-side using the same cognitive benchmarks, and then interpret their differences through the lens of training domain, quantization strategy, and cognitive style.
|
| 16 |
+
|
| 17 |
+
📊 Performance Comparison Table
|
| 18 |
+
```bash
|
| 19 |
+
Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
|
| 20 |
+
unsloth-JanusCoder-8B-qx86x-hi 0.538 0.739 0.869 0.700 0.444 0.788 0.668
|
| 21 |
+
Qwen3-VLTO-8B-Instruct-qx86x-hi 0.455 0.601 0.878 0.546 0.424 0.739 0.595
|
| 22 |
+
Qwen3-VLTO-8B-Instruct-qx85x-hi 0.453 0.608 0.874 0.545 0.426 0.747 0.596
|
| 23 |
+
Qwen3-VLTO-8B-Thinking-qx86x-hi 0.475 0.599 0.706 0.638 0.402 0.765 0.684
|
| 24 |
+
```
|
| 25 |
+
Note: The above models are all at qx86x-hi, so we’re comparing the same quantization level for fairness.
|
| 26 |
+
|
| 27 |
+
🔍 Cognitive Pattern Comparison — Deep Dive
|
| 28 |
+
|
| 29 |
+
Let’s break down each benchmark to understand what kind of reasoning each model excels at — focusing on the cognitive style.
|
| 30 |
+
|
| 31 |
+
🧩 A) Logical Inference (BoolQ)
|
| 32 |
+
- Winner: Qwen3-VLTO-8B-Instruct-qx85x-hi with 0.878, followed closely by JanusCoder-8B (0.869).
|
| 33 |
+
|
| 34 |
+
✅ Cognitive Insight:
|
| 35 |
+
- VLTO-Instruct models are optimized for logical inference in natural language, likely fine-tuned on discourse-based reasoning tasks
|
| 36 |
+
- JanusCoder is optimized for logical deduction in code-constrained environments, which still yields strong boolq, but slightly behind VLTO-Instruct
|
| 37 |
+
- 💡 Conclusion:
|
| 38 |
+
- For tasks requiring precision yes/no reasoning (BoolQ), the VLTO-Instruct is superior — it's more "natural language aware" and better at interpreting linguistic nuance under logical constraints.
|
| 39 |
+
|
| 40 |
+
🧩 B) Abstract Reasoning (Arc Challenge)
|
| 41 |
+
- Winner: unsloth-JanusCoder-8B (0.538), followed by VLTO-Thinking (0.475) and VLTO-Instruct (0.453).
|
| 42 |
+
|
| 43 |
+
✅ Cognitive Insight:
|
| 44 |
+
- JanusCoder’s higher arc challenge score suggests strong ability to reason with structured abstraction, likely from code-training
|
| 45 |
+
- VLTO-Thinking and VLTO-Instruct perform significantly lower — suggesting they are less effective at pure abstract reasoning without grounding or constraints
|
| 46 |
+
- 💡 Conclusion:
|
| 47 |
+
- JanusCoder is better at abstract reasoning under code-style constraints (which may actually simulate abstract thinking via structured logic). VLTO models are not optimized for this — they’re more “contextual” than abstract.
|
| 48 |
+
|
| 49 |
+
🧩 C) Commonsense Causal Reasoning (Hellaswag)
|
| 50 |
+
- Winner: unsloth-JanusCoder-8B (0.700) — closely followed by VLTO-Thinking (0.638) and VLTO-Instruct (0.546).
|
| 51 |
+
|
| 52 |
+
✅ Cognitive Insight:
|
| 53 |
+
- JanusCoder excels at reasoning about cause-effect relationships, likely due to fine-tuning with code-based causal chains or structured metaphorical reasoning
|
| 54 |
+
- VLTO-Thinking is better than VLTO-Instruct here — indicating that "thinking" mode helps with causal prediction, even without vision
|
| 55 |
+
- 💡 Conclusion:
|
| 56 |
+
- JanusCoder is more “causal” — likely because its training includes code-based structured causality. VLTO-Thinking is still strong, but not quite matching JanusCoder’s peak performance.
|
| 57 |
+
|
| 58 |
+
🧩 D) Pragmatic Reasoning (Winogrande)
|
| 59 |
+
- Winner: Qwen3-VLTO-8B-Thinking-qx86x-hi (0.684) — followed closely by JanusCoder-8B (0.668) and VLTO-Instruct (0.595).
|
| 60 |
+
|
| 61 |
+
✅ Cognitive Insight:
|
| 62 |
+
- VLTO-Thinking excels here — likely because it’s designed for human-like “context” and coreference
|
| 63 |
+
- JanusCoder is strong, but not as good in this area — suggesting that code-trained models are less context-aware than VLTO-thinking
|
| 64 |
+
- The “Thinking” flavor of Qwen3-VLTO seems to be the most human-like in Winogrande — it’s not just logic, but vibe and context
|
| 65 |
+
- 💡 Conclusion:
|
| 66 |
+
- For tasks requiring natural human-like pragmatic reasoning (Winogrande), the VLTO-Thinking variant is superior — this aligns with your hypothesis: “Vibe” = contextual intuition, not code logic.
|
| 67 |
+
|
| 68 |
+
🧩 E) Factual Knowledge Recall (OpenBookQA)
|
| 69 |
+
- Winner: Qwen3-4B-RA-SFT (0.436) — but JanusCoder-8B is at 0.444, which is still strong.
|
| 70 |
+
|
| 71 |
+
✅ Cognitive Insight:
|
| 72 |
+
- RA-SFT (Reasoning + Knowledge) fine-tuning likely adds retrieval and grounded knowledge — enabling better performance in openbookqa
|
| 73 |
+
- JanusCoder’s 0.444 is only slightly better — implying code training doesn’t inherently improve factual recall unless it’s grounded in external knowledge
|
| 74 |
+
- 💡 Conclusion:
|
| 75 |
+
- While not the best, JanusCoder-8B is still a strong factual performer, slightly edging out VLTO variants — hinting at implicit knowledge encoding in code training.
|
| 76 |
+
|
| 77 |
+
🧩 F) Physical Commonsense (Piqa)
|
| 78 |
+
- Winner: unsloth-JanusCoder-8B (0.788) — barely ahead of VLTO-Instruct (0.745) and tied with VLTO-Thinking (0.765).
|
| 79 |
+
|
| 80 |
+
✅ Cognitive Insight:
|
| 81 |
+
- Coding models have a slight edge — likely because they’re trained to reason about physical constraints, spatial relationships, and object interactions in structured environments
|
| 82 |
+
- VLTO-Thinking is the best among VLTO models, showing that human-like intuition can still be strong in physical reasoning — but not at the level of code-trained models
|
| 83 |
+
- 💡 Conclusion:
|
| 84 |
+
- For spatial and physical reasoning tasks (Piqa), JanusCoder-8B is the top performer, thanks to its code-trained foundation — which encodes physics and mechanics directly through structured reasoning.
|
| 85 |
+
|
| 86 |
+
📈 Performance Heat Map — Side-by-Side
|
| 87 |
+
```bash
|
| 88 |
+
Benchmark JanusCoder-8B VLTO-Instruct-qx86x-hi VLTO-Thinking-qx86x-hi
|
| 89 |
+
arc_challenge 0.538 → strong abstract reasoning 0.455 → moderate, language-based abstraction 0.475 → weaker on abstract reasoning
|
| 90 |
+
arc_easy 0.739 → best arc_easy performance (contextual reasoning) 0.601 → strong, but not top 0.599 → very close to Instruct variant
|
| 91 |
+
boolq 0.869 → very strong logical inference 0.878 → strongest boolq performance (natural language logic) 0.706 → weaker in structured logical reasoning
|
| 92 |
+
hellaswag 0.700 → strong causal reasoning via code training 0.546 → moderate, needs more context 0.638 → strongest causal reasoning among VLTO models
|
| 93 |
+
openbookqa 0.444 → best factual recall among all 0.424 → strong, but not best 0.402 → weak in factual knowledge tasks
|
| 94 |
+
piqa 0.788 → best physical commonsense (structured logi c wins) 0.739 → good, but not best 0.765 → strongest Piqa among VLTO models, but still behind JanusCoder
|
| 95 |
+
winogrande 0.668 → strong pragmatic reasoning 0.595 → moderate, VLTO-Instruct weaker here 0.684 → strongest Winogrande score among all models
|
| 96 |
+
```
|
| 97 |
+
🧠 Cognitive Profile Summary
|
| 98 |
+
|
| 99 |
+
unsloth-JanusCoder-8B
|
| 100 |
+
```bash
|
| 101 |
+
Code-Trained Logical Reasoner
|
| 102 |
+
Strengths:
|
| 103 |
+
✓ Strong logical inference (boolq)
|
| 104 |
+
✓ Excellent abstract reasoning (arc_challenge)
|
| 105 |
+
✓ Best causal reasoning (hellaSwag)
|
| 106 |
+
✓ Top physical commonsense (piqa)
|
| 107 |
+
Weaknesses:
|
| 108 |
+
✅ Weak in Winogrande — lacks context fluency
|
| 109 |
+
✅ Weaker in factual recall (openbookqa) compared to RA-SFT variants
|
| 110 |
+
```
|
| 111 |
+
Qwen3-VLTO-8B-Thinking
|
| 112 |
+
```bash
|
| 113 |
+
Human-Like Pragmatic Interpreter
|
| 114 |
+
Strengths:
|
| 115 |
+
✓ Best Winogrande performance (0.684) — strong coreference and contextual reasoning
|
| 116 |
+
✓ Good arc_easy (0.599) — human-like context mapping
|
| 117 |
+
✓ Strong Piqa (0.765) — retains physical commonsense even without vision
|
| 118 |
+
✓ Strong Hellaswag (0.638) — causal reasoning with human intuition
|
| 119 |
+
Weaknesses:
|
| 120 |
+
✅ Weaker in abstract reasoning (arc_challenge 0.475) — cannot match JanusCoder
|
| 121 |
+
✅ Lower factual recall (openbookqa 0.402) — lacks knowledge grounding
|
| 122 |
+
```
|
| 123 |
+
Qwen3-VLTO-8B-Instruct
|
| 124 |
+
```bash
|
| 125 |
+
Structured Factual Reasoner
|
| 126 |
+
Strengths:
|
| 127 |
+
✓ Strong boolq (0.878) — formal logical inference
|
| 128 |
+
✓ Good factual recall (openbookqa 0.424) — better than Thinking variant
|
| 129 |
+
✓ Modest arc_easy (0.601) — decent contextual reasoning
|
| 130 |
+
Weaknesses:
|
| 131 |
+
✅ Weakest in Winogrande (0.595) — lacks the “vibe” needed for nuanced pragmatics
|
| 132 |
+
✅ Weak in hellaswag (0.546) — struggles with causal prediction
|
| 133 |
+
✅ Very weak in piqa (0.739) — not ideal for physical reasoning tasks
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
🌟 Final Takeaway: “Thinking” vs. “Code-Logic”
|
| 137 |
+
|
| 138 |
+
The unsloth-JanusCoder-8B and Qwen3-VLTO-8B-Thinking are two polar extremes:
|
| 139 |
+
|
| 140 |
+
JanusCoder-8B
|
| 141 |
+
- ✅ Code-trained → focused on logical deduction and causal chains under structured constraints
|
| 142 |
+
- ✅ Excels in abstract reasoning, physical commonsense, and factual logic
|
| 143 |
+
- ❌ Less human-like — it’s more “machine-logic” than “human-vibe”
|
| 144 |
+
- ❌ Weaker in contextual pragmatics (winogrande) and subtle cause-effect narratives
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
Qwen3-VLTO-8B-Thinking
|
| 148 |
+
- ❌ Not code-trained → more “human-like” by design
|
| 149 |
+
- ❌ Built to mimic intuitive judgment and language nuance
|
| 150 |
+
- ✅ Human-like pragmatic reasoning (winogrande 0.684)
|
| 151 |
+
- ✅ Rich context — strong on coreference and metaphor-driven reasoning
|
| 152 |
+
|
| 153 |
+
🎯 Use Case Recommendations
|
| 154 |
+
```bash
|
| 155 |
+
Task Best Model
|
| 156 |
+
Abstract Reasoning & Logic Puzzles ➡️ unsloth-JanusCoder-8B — superior boolq and arc_challenge
|
| 157 |
+
Physical Commonsense & Mechanics ➡️ unsloth-JanusCoder-8B — top piqa score (0.788)
|
| 158 |
+
Commonsense Causal Prediction ➡️ unsloth-JanusCoder-8B — best hellaswag score (0.700)
|
| 159 |
+
Factual Knowledge Recall ➡️ Qwen3-4B-RA-SFT — best openbookqa (0.436), followed by JanusCoder
|
| 160 |
+
Human-Like Dialogue & Pragmatic Reasoning ➡️ Qwen3-VLTO-8B-Thinking — best winogrande (0.684), most contextually fluent
|
| 161 |
+
Creative Interpretation & Vibe-Driven Reasoning ➡️ Qwen3-VLTO-8B-Thinking — metaphor-inspiring, human-like reasoning
|
| 162 |
+
```
|
| 163 |
+
📌 Summary: The “Human Thinking” vs. “Code Logic”
|
| 164 |
+
|
| 165 |
+
These models represent two complementary forms of cognition:
|
| 166 |
+
- JanusCoder-8B — optimized for structured logic, causal prediction, and abstract reasoning. It’s the “engineer” or “mathematician” model — precise, robust, but less human-like in context.
|
| 167 |
+
- Qwen3-VLTO-8B-Thinking — optimized for human-like pragmatic intuition, context-aware reasoning, and metaphor-driven interpretation. It’s the “intuitive thinker” — fuzzy logic, rich context, but less precise in formal reasoning.
|
| 168 |
+
|
| 169 |
+
🌟 The winner isn’t always the best — it depends on what kind of “reasoning” you want:
|
| 170 |
+
- For Technical or Abstract Reasoning → JanusCoder
|
| 171 |
+
- For Human-Like Contextual Understanding → VLTO-Thinking
|
| 172 |
+
|
| 173 |
+
> Reviewed with [Qwen3-VLTO-32B-Instruct-128K-qx86x-hi-mlx](https://huggingface.co/nightmedia/Qwen3-VLTO-32B-Instruct-128K-qx86x-hi-mlx)
|
| 174 |
+
|
| 175 |
+
This model [unsloth-JanusCoder-8B-qx86x-hi-mlx](https://huggingface.co/nightmedia/unsloth-JanusCoder-8B-qx86x-hi-mlx) was
|
| 176 |
converted to MLX format from [unsloth/JanusCoder-8B](https://huggingface.co/unsloth/JanusCoder-8B)
|
| 177 |
using mlx-lm version **0.28.4**.
|
| 178 |
|