eaddario commited on
Commit
10312a8
·
verified ·
1 Parent(s): 087424a

Generate Perplexity, KLD, ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. scores/Watt-Tool-8B-F16.arc +6 -6
  2. scores/Watt-Tool-8B-F16.hsw +5 -5
  3. scores/Watt-Tool-8B-F16.mmlu +5 -5
  4. scores/Watt-Tool-8B-F16.tqa +6 -6
  5. scores/Watt-Tool-8B-F16.wng +5 -5
  6. scores/Watt-Tool-8B-Q4_K_M-naive.arc +0 -13
  7. scores/Watt-Tool-8B-Q4_K_M-naive.hsw +0 -12
  8. scores/Watt-Tool-8B-Q4_K_M-naive.mmlu +0 -13
  9. scores/Watt-Tool-8B-Q4_K_M-naive.ppx +0 -37
  10. scores/Watt-Tool-8B-Q4_K_M-naive.tqa +0 -13
  11. scores/Watt-Tool-8B-Q4_K_M-naive.wng +0 -11
  12. scores/Watt-Tool-8B-iq3_m.arc +6 -6
  13. scores/Watt-Tool-8B-iq3_m.hsw +5 -5
  14. scores/Watt-Tool-8B-iq3_m.mmlu +5 -5
  15. scores/Watt-Tool-8B-iq3_m.ppx +30 -30
  16. scores/Watt-Tool-8B-iq3_m.tqa +6 -6
  17. scores/Watt-Tool-8B-iq3_m.wng +5 -5
  18. scores/Watt-Tool-8B-iq3_s.arc +6 -6
  19. scores/Watt-Tool-8B-iq3_s.hsw +5 -5
  20. scores/Watt-Tool-8B-iq3_s.mmlu +5 -5
  21. scores/Watt-Tool-8B-iq3_s.ppx +31 -31
  22. scores/Watt-Tool-8B-iq3_s.tqa +6 -6
  23. scores/Watt-Tool-8B-iq3_s.wng +5 -5
  24. scores/Watt-Tool-8B-iq4_nl.arc +6 -6
  25. scores/Watt-Tool-8B-iq4_nl.hsw +5 -5
  26. scores/Watt-Tool-8B-iq4_nl.mmlu +5 -5
  27. scores/Watt-Tool-8B-iq4_nl.ppx +31 -31
  28. scores/Watt-Tool-8B-iq4_nl.tqa +6 -6
  29. scores/Watt-Tool-8B-iq4_nl.wng +5 -5
  30. scores/Watt-Tool-8B-q3_k_l.arc +6 -6
  31. scores/Watt-Tool-8B-q3_k_l.hsw +5 -5
  32. scores/Watt-Tool-8B-q3_k_l.mmlu +5 -5
  33. scores/Watt-Tool-8B-q3_k_l.ppx +31 -31
  34. scores/Watt-Tool-8B-q3_k_l.tqa +6 -6
  35. scores/Watt-Tool-8B-q3_k_l.wng +5 -5
  36. scores/Watt-Tool-8B-q3_k_m.arc +6 -6
  37. scores/Watt-Tool-8B-q3_k_m.hsw +5 -5
  38. scores/Watt-Tool-8B-q3_k_m.mmlu +5 -5
  39. scores/Watt-Tool-8B-q3_k_m.ppx +31 -31
  40. scores/Watt-Tool-8B-q3_k_m.tqa +6 -6
  41. scores/Watt-Tool-8B-q3_k_m.wng +5 -5
  42. scores/Watt-Tool-8B-q3_k_s.arc +6 -6
  43. scores/Watt-Tool-8B-q3_k_s.hsw +5 -5
  44. scores/Watt-Tool-8B-q3_k_s.mmlu +5 -5
  45. scores/Watt-Tool-8B-q3_k_s.ppx +31 -31
  46. scores/Watt-Tool-8B-q3_k_s.tqa +6 -6
  47. scores/Watt-Tool-8B-q3_k_s.wng +5 -5
  48. scores/Watt-Tool-8B-q4_k_m.arc +6 -6
  49. scores/Watt-Tool-8B-q4_k_m.hsw +5 -5
  50. scores/Watt-Tool-8B-q4_k_m.mmlu +5 -5
scores/Watt-Tool-8B-F16.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 65.2870 +/- 1.7406
6
- Random chance: 25.0334 +/- 1.5840
7
 
8
 
9
- llama_perf_context_print: load time = 7009.00 ms
10
- llama_perf_context_print: prompt eval time = 157202.97 ms / 36703 tokens ( 4.28 ms per token, 233.48 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 159569.20 ms / 36704 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 65.8667 +/- 1.7325
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 7049.26 ms
10
+ llama_perf_context_print: prompt eval time = 109446.86 ms / 36600 tokens ( 2.99 ms per token, 334.41 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 110483.23 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-F16.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
- 750 80.93333333
6
 
7
 
8
- llama_perf_context_print: load time = 622.45 ms
9
- llama_perf_context_print: prompt eval time = 412508.17 ms / 125702 tokens ( 3.28 ms per token, 304.73 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 418728.78 ms / 125703 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
+ 750 78.66666667% [75.5926%, 81.4486%]
6
 
7
 
8
+ llama_perf_context_print: load time = 580.08 ms
9
+ llama_perf_context_print: prompt eval time = 381945.70 ms / 126448 tokens ( 3.02 ms per token, 331.06 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 386591.06 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-F16.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 42.1333 +/- 1.8042
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 625.31 ms
10
- llama_perf_context_print: prompt eval time = 245071.99 ms / 69227 tokens ( 3.54 ms per token, 282.48 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 247990.59 ms / 69228 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 40.9333 +/- 1.7967
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 596.70 ms
10
+ llama_perf_context_print: prompt eval time = 197375.99 ms / 67195 tokens ( 2.94 ms per token, 340.44 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 198932.74 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-F16.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 36.1963 +/- 2.6657
6
- Random chance: 28.6467 +/- 2.5079
7
 
8
 
9
- llama_perf_context_print: load time = 621.37 ms
10
- llama_perf_context_print: prompt eval time = 74638.90 ms / 17686 tokens ( 4.22 ms per token, 236.95 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 75960.59 ms / 17687 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 32.9333 +/- 1.7172
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 624.82 ms
10
+ llama_perf_context_print: prompt eval time = 153527.41 ms / 50072 tokens ( 3.07 ms per token, 326.14 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 155568.93 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-F16.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 74.0000 +/- 1.6027
6
 
7
- llama_perf_context_print: load time = 621.33 ms
8
- llama_perf_context_print: prompt eval time = 86368.81 ms / 22255 tokens ( 3.88 ms per token, 257.67 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 87630.93 ms / 22256 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 36 key-value pairs and 292 tensors from ./Watt-Tool-8B-F16.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 74.8000 +/- 1.5864
6
 
7
+ llama_perf_context_print: load time = 624.64 ms
8
+ llama_perf_context_print: prompt eval time = 66689.29 ms / 22192 tokens ( 3.01 ms per token, 332.77 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 67279.36 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-Q4_K_M-naive.arc DELETED
@@ -1,13 +0,0 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
- llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
- llama_model_loader: loaded meta data with 42 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M-naive.gguf (version GGUF V3 (latest))
4
-
5
- Final result: 62.5668 +/- 1.7707
6
- Random chance: 25.0251 +/- 1.5848
7
-
8
-
9
- llama_perf_context_print: load time = 707.57 ms
10
- llama_perf_context_print: prompt eval time = 164606.88 ms / 36539 tokens ( 4.50 ms per token, 221.98 tokens per second)
11
- llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 166874.76 ms / 36540 tokens
13
- ggml_metal_free: deallocating
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-Q4_K_M-naive.hsw DELETED
@@ -1,12 +0,0 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
- llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
- llama_model_loader: loaded meta data with 42 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M-naive.gguf (version GGUF V3 (latest))
4
-
5
- 750 77.73333333
6
-
7
-
8
- llama_perf_context_print: load time = 306.76 ms
9
- llama_perf_context_print: prompt eval time = 436291.37 ms / 122836 tokens ( 3.55 ms per token, 281.55 tokens per second)
10
- llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 441964.91 ms / 122837 tokens
12
- ggml_metal_free: deallocating
 
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-Q4_K_M-naive.mmlu DELETED
@@ -1,13 +0,0 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
- llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
- llama_model_loader: loaded meta data with 42 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M-naive.gguf (version GGUF V3 (latest))
4
-
5
- Final result: 42.0000 +/- 1.8034
6
- Random chance: 25.0000 +/- 1.5822
7
-
8
-
9
- llama_perf_context_print: load time = 304.34 ms
10
- llama_perf_context_print: prompt eval time = 262641.92 ms / 69673 tokens ( 3.77 ms per token, 265.28 tokens per second)
11
- llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 265464.52 ms / 69674 tokens
13
- ggml_metal_free: deallocating
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-Q4_K_M-naive.ppx DELETED
@@ -1,37 +0,0 @@
1
- ====== Perplexity statistics ======
2
- Mean PPL(Q) : 7.409510 ± 0.046740
3
- Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 99.65%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.023545 ± 0.000530
6
- Mean PPL(Q)/PPL(base) : 1.023825 ± 0.000543
7
- Mean PPL(Q)-PPL(base) : 0.172420 ± 0.004061
8
-
9
- ====== KL divergence statistics ======
10
- Mean KLD: 0.017663 ± 0.000107
11
- Maximum KLD: 5.749704
12
- 99.9% KLD: 0.447724
13
- 99.0% KLD: 0.139140
14
- 99.0% KLD: 0.139140
15
- Median KLD: 0.010320
16
- 10.0% KLD: 0.000617
17
- 5.0% KLD: 0.000201
18
- 1.0% KLD: 0.000027
19
- Minimum KLD: -0.000129
20
-
21
- ====== Token probability statistics ======
22
- Mean Δp: -0.531 ± 0.010 %
23
- Maximum Δp: 55.716%
24
- 99.9% Δp: 17.458%
25
- 99.0% Δp: 8.256%
26
- 95.0% Δp: 3.790%
27
- 90.0% Δp: 2.138%
28
- 75.0% Δp: 0.367%
29
- Median Δp: -0.034%
30
- 25.0% Δp: -1.129%
31
- 10.0% Δp: -3.654%
32
- 5.0% Δp: -5.855%
33
- 1.0% Δp: -12.744%
34
- 0.1% Δp: -31.910%
35
- Minimum Δp: -99.362%
36
- RMS Δp : 3.658 ± 0.032 %
37
- Same top p: 93.743 ± 0.064 %
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-Q4_K_M-naive.tqa DELETED
@@ -1,13 +0,0 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
- llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
- llama_model_loader: loaded meta data with 42 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M-naive.gguf (version GGUF V3 (latest))
4
-
5
- Final result: 36.8098 +/- 2.6753
6
- Random chance: 28.5214 +/- 2.5046
7
-
8
-
9
- llama_perf_context_print: load time = 306.51 ms
10
- llama_perf_context_print: prompt eval time = 78347.98 ms / 17655 tokens ( 4.44 ms per token, 225.34 tokens per second)
11
- llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 79593.31 ms / 17656 tokens
13
- ggml_metal_free: deallocating
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-Q4_K_M-naive.wng DELETED
@@ -1,11 +0,0 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
- llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
- llama_model_loader: loaded meta data with 42 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M-naive.gguf (version GGUF V3 (latest))
4
-
5
- Final Winogrande score(750 tasks): 73.6000 +/- 1.6106
6
-
7
- llama_perf_context_print: load time = 295.82 ms
8
- llama_perf_context_print: prompt eval time = 90900.17 ms / 22246 tokens ( 4.09 ms per token, 244.73 tokens per second)
9
- llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 92103.74 ms / 22247 tokens
11
- ggml_metal_free: deallocating
 
 
 
 
 
 
 
 
 
 
 
 
scores/Watt-Tool-8B-iq3_m.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 57.6203 +/- 1.8080
6
- Random chance: 25.0251 +/- 1.5848
7
 
8
 
9
- llama_perf_context_print: load time = 1615.23 ms
10
- llama_perf_context_print: prompt eval time = 160568.89 ms / 36381 tokens ( 4.41 ms per token, 226.58 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 162816.72 ms / 36382 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 62.8000 +/- 1.7661
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 1734.01 ms
10
+ llama_perf_context_print: prompt eval time = 115043.84 ms / 36600 tokens ( 3.14 ms per token, 318.14 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 115943.71 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_m.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
- 750 78.80000000
6
 
7
 
8
- llama_perf_context_print: load time = 279.18 ms
9
- llama_perf_context_print: prompt eval time = 433163.50 ms / 124534 tokens ( 3.48 ms per token, 287.50 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 438833.07 ms / 124535 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
+ 750 78.00000000% [74.8968%, 80.8179%]
6
 
7
 
8
+ llama_perf_context_print: load time = 291.01 ms
9
+ llama_perf_context_print: prompt eval time = 400031.71 ms / 126448 tokens ( 3.16 ms per token, 316.09 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 404305.75 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_m.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 36.2667 +/- 1.7567
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 280.68 ms
10
- llama_perf_context_print: prompt eval time = 260791.25 ms / 70687 tokens ( 3.69 ms per token, 271.05 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 263549.69 ms / 70688 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 37.7333 +/- 1.7711
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 289.89 ms
10
+ llama_perf_context_print: prompt eval time = 206632.46 ms / 67195 tokens ( 3.08 ms per token, 325.19 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 208049.33 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_m.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 8.963688 ± 0.058386
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 95.93%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.213963 ± 0.001840
6
- Mean PPL(Q)/PPL(base) : 1.238576 ± 0.002279
7
- Mean PPL(Q)-PPL(base) : 1.726598 ± 0.019534
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.209768 ± 0.000734
11
- Maximum KLD: 11.045219
12
- 99.9% KLD: 3.037534
13
- 99.0% KLD: 1.327696
14
- 99.0% KLD: 1.327696
15
- Median KLD: 0.145406
16
- 10.0% KLD: 0.013515
17
- 5.0% KLD: 0.004453
18
- 1.0% KLD: 0.000589
19
  Minimum KLD: 0.000000
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: -4.187 ± 0.035 %
23
- Maximum Δp: 87.898%
24
- 99.9% Δp: 53.785%
25
- 99.0% Δp: 29.326%
26
- 95.0% Δp: 12.210%
27
- 90.0% Δp: 5.784%
28
- 75.0% Δp: 0.302%
29
- Median Δp: -0.896%
30
- 25.0% Δp: -7.479%
31
- 10.0% Δp: -19.150%
32
- 5.0% Δp: -28.492%
33
- 1.0% Δp: -52.697%
34
- 0.1% Δp: -83.945%
35
- Minimum Δp: -97.765%
36
- RMS Δp : 13.969 ± 0.056 %
37
- Same top p: 77.664 ± 0.110 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 7.841948 ± 0.049502
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 98.36%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.080268 ± 0.001143
6
+ Mean PPL(Q)/PPL(base) : 1.083578 ± 0.001238
7
+ Mean PPL(Q)-PPL(base) : 0.604858 ± 0.009476
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.081774 ± 0.000354
11
+ Maximum KLD: 7.690053
12
+ 99.9% KLD: 1.654508
13
+ 99.0% KLD: 0.555790
14
+ 99.0% KLD: 0.555790
15
+ Median KLD: 0.056256
16
+ 10.0% KLD: 0.003426
17
+ 5.0% KLD: 0.001063
18
+ 1.0% KLD: 0.000157
19
  Minimum KLD: 0.000000
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -2.133 ± 0.021 %
23
+ Maximum Δp: 73.495%
24
+ 99.9% Δp: 32.336%
25
+ 99.0% Δp: 17.093%
26
+ 95.0% Δp: 7.846%
27
+ 90.0% Δp: 4.045%
28
+ 75.0% Δp: 0.372%
29
+ Median Δp: -0.301%
30
+ 25.0% Δp: -3.967%
31
+ 10.0% Δp: -10.805%
32
+ 5.0% Δp: -16.007%
33
+ 1.0% Δp: -30.015%
34
+ 0.1% Δp: -62.256%
35
+ Minimum Δp: -96.763%
36
+ RMS Δp : 8.316 ± 0.043 %
37
+ Same top p: 85.224 ± 0.094 %
scores/Watt-Tool-8B-iq3_m.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 33.2308 +/- 2.6169
6
- Random chance: 28.5589 +/- 2.5094
7
 
8
 
9
- llama_perf_context_print: load time = 280.73 ms
10
- llama_perf_context_print: prompt eval time = 76784.24 ms / 17625 tokens ( 4.36 ms per token, 229.54 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 78027.49 ms / 17626 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 32.1333 +/- 1.7063
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 288.09 ms
10
+ llama_perf_context_print: prompt eval time = 161368.09 ms / 50072 tokens ( 3.22 ms per token, 310.30 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 163199.21 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_m.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 70.9333 +/- 1.6591
6
 
7
- llama_perf_context_print: load time = 284.52 ms
8
- llama_perf_context_print: prompt eval time = 89372.55 ms / 22269 tokens ( 4.01 ms per token, 249.17 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 90553.60 ms / 22270 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 73.6000 +/- 1.6106
6
 
7
+ llama_perf_context_print: load time = 288.50 ms
8
+ llama_perf_context_print: prompt eval time = 70143.95 ms / 22192 tokens ( 3.16 ms per token, 316.38 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 70631.06 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_s.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 57.3529 +/- 1.8095
6
- Random chance: 25.0335 +/- 1.5850
7
 
8
 
9
- llama_perf_context_print: load time = 1575.89 ms
10
- llama_perf_context_print: prompt eval time = 159215.12 ms / 36653 tokens ( 4.34 ms per token, 230.21 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 161465.24 ms / 36654 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 62.0000 +/- 1.7736
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 1662.56 ms
10
+ llama_perf_context_print: prompt eval time = 115280.42 ms / 36600 tokens ( 3.15 ms per token, 317.49 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 116185.54 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_s.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
- 750 77.20000000
6
 
7
 
8
- llama_perf_context_print: load time = 298.56 ms
9
- llama_perf_context_print: prompt eval time = 430279.23 ms / 124462 tokens ( 3.46 ms per token, 289.26 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 436034.64 ms / 124463 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
+ 750 76.26666667% [73.0928%, 79.1728%]
6
 
7
 
8
+ llama_perf_context_print: load time = 286.42 ms
9
+ llama_perf_context_print: prompt eval time = 400735.90 ms / 126448 tokens ( 3.17 ms per token, 315.54 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 405023.75 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_s.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 36.1333 +/- 1.7553
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 279.05 ms
10
- llama_perf_context_print: prompt eval time = 257292.94 ms / 70079 tokens ( 3.67 ms per token, 272.37 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 260084.41 ms / 70080 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 37.3333 +/- 1.7674
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 293.04 ms
10
+ llama_perf_context_print: prompt eval time = 207055.09 ms / 67195 tokens ( 3.08 ms per token, 324.53 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 208462.26 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_s.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 9.032577 ± 0.058532
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 95.96%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.221619 ± 0.001825
6
- Mean PPL(Q)/PPL(base) : 1.248095 ± 0.002278
7
- Mean PPL(Q)-PPL(base) : 1.795488 ± 0.019603
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.204862 ± 0.000758
11
- Maximum KLD: 9.152626
12
- 99.9% KLD: 3.342306
13
- 99.0% KLD: 1.358416
14
- 99.0% KLD: 1.358416
15
- Median KLD: 0.143007
16
- 10.0% KLD: 0.013411
17
- 5.0% KLD: 0.004899
18
- 1.0% KLD: 0.000896
19
- Minimum KLD: 0.000000
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: -4.675 ± 0.034 %
23
- Maximum Δp: 83.272%
24
- 99.9% Δp: 42.833%
25
- 99.0% Δp: 23.669%
26
- 95.0% Δp: 10.247%
27
- 90.0% Δp: 4.840%
28
- 75.0% Δp: 0.184%
29
- Median Δp: -1.059%
30
- 25.0% Δp: -7.819%
31
- 10.0% Δp: -19.392%
32
- 5.0% Δp: -28.470%
33
- 1.0% Δp: -53.110%
34
- 0.1% Δp: -86.125%
35
- Minimum Δp: -99.905%
36
- RMS Δp : 13.678 ± 0.058 %
37
- Same top p: 78.198 ± 0.109 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 8.253598 ± 0.051864
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 97.71%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.131430 ± 0.001346
6
+ Mean PPL(Q)/PPL(base) : 1.140458 ± 0.001535
7
+ Mean PPL(Q)-PPL(base) : 1.016508 ± 0.012175
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.117565 ± 0.000433
11
+ Maximum KLD: 7.079286
12
+ 99.9% KLD: 1.966468
13
+ 99.0% KLD: 0.726076
14
+ 99.0% KLD: 0.726076
15
+ Median KLD: 0.084699
16
+ 10.0% KLD: 0.006988
17
+ 5.0% KLD: 0.002383
18
+ 1.0% KLD: 0.000330
19
+ Minimum KLD: -0.000001
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -3.685 ± 0.026 %
23
+ Maximum Δp: 69.513%
24
+ 99.9% Δp: 34.570%
25
+ 99.0% Δp: 17.585%
26
+ 95.0% Δp: 7.273%
27
+ 90.0% Δp: 3.369%
28
+ 75.0% Δp: 0.113%
29
+ Median Δp: -0.833%
30
+ 25.0% Δp: -6.212%
31
+ 10.0% Δp: -15.079%
32
+ 5.0% Δp: -21.666%
33
+ 1.0% Δp: -38.754%
34
+ 0.1% Δp: -69.188%
35
+ Minimum Δp: -97.122%
36
+ RMS Δp : 10.385 ± 0.045 %
37
+ Same top p: 82.770 ± 0.100 %
scores/Watt-Tool-8B-iq3_s.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 35.5346 +/- 2.6882
6
- Random chance: 28.4691 +/- 2.5346
7
 
8
 
9
- llama_perf_context_print: load time = 277.68 ms
10
- llama_perf_context_print: prompt eval time = 74951.02 ms / 17379 tokens ( 4.31 ms per token, 231.87 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 76169.50 ms / 17380 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 30.4000 +/- 1.6807
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 284.74 ms
10
+ llama_perf_context_print: prompt eval time = 161670.39 ms / 50072 tokens ( 3.23 ms per token, 309.72 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 163511.55 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq3_s.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 70.2667 +/- 1.6702
6
 
7
- llama_perf_context_print: load time = 277.86 ms
8
- llama_perf_context_print: prompt eval time = 88618.25 ms / 22199 tokens ( 3.99 ms per token, 250.50 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 89806.41 ms / 22200 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ3_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 72.9333 +/- 1.6235
6
 
7
+ llama_perf_context_print: load time = 291.13 ms
8
+ llama_perf_context_print: prompt eval time = 70279.68 ms / 22192 tokens ( 3.17 ms per token, 315.77 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 70763.82 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq4_nl.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 64.7925 +/- 1.7487
6
- Random chance: 25.0251 +/- 1.5859
7
 
8
 
9
- llama_perf_context_print: load time = 1990.26 ms
10
- llama_perf_context_print: prompt eval time = 156702.59 ms / 36807 tokens ( 4.26 ms per token, 234.88 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 158963.12 ms / 36808 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 63.4667 +/- 1.7594
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 2048.98 ms
10
+ llama_perf_context_print: prompt eval time = 119074.35 ms / 36600 tokens ( 3.25 ms per token, 307.37 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 119972.06 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq4_nl.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
- 750 77.86666667
6
 
7
 
8
- llama_perf_context_print: load time = 285.97 ms
9
- llama_perf_context_print: prompt eval time = 426128.64 ms / 126096 tokens ( 3.38 ms per token, 295.91 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 431845.15 ms / 126097 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
+ 750 77.73333333% [74.6188%, 80.5653%]
6
 
7
 
8
+ llama_perf_context_print: load time = 297.65 ms
9
+ llama_perf_context_print: prompt eval time = 413170.51 ms / 126448 tokens ( 3.27 ms per token, 306.04 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 417435.80 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq4_nl.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 39.7333 +/- 1.7880
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 283.49 ms
10
- llama_perf_context_print: prompt eval time = 260134.32 ms / 72070 tokens ( 3.61 ms per token, 277.05 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 262954.84 ms / 72071 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 39.6000 +/- 1.7870
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 304.48 ms
10
+ llama_perf_context_print: prompt eval time = 213794.40 ms / 67195 tokens ( 3.18 ms per token, 314.30 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 215197.82 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq4_nl.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 7.935917 ± 0.053744
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 98.36%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.092180 ± 0.001277
6
- Mean PPL(Q)/PPL(base) : 1.096562 ± 0.001400
7
- Mean PPL(Q)-PPL(base) : 0.698827 ± 0.012156
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.096510 ± 0.000325
11
- Maximum KLD: 5.312946
12
- 99.9% KLD: 1.205405
13
- 99.0% KLD: 0.590747
14
- 99.0% KLD: 0.590747
15
- Median KLD: 0.069663
16
- 10.0% KLD: 0.003776
17
- 5.0% KLD: 0.001002
18
- 1.0% KLD: 0.000098
19
- Minimum KLD: -0.000140
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: 0.914 ± 0.023 %
23
- Maximum Δp: 68.385%
24
- 99.9% Δp: 48.877%
25
- 99.0% Δp: 30.041%
26
- 95.0% Δp: 15.934%
27
- 90.0% Δp: 10.078%
28
- 75.0% Δp: 2.824%
29
- Median Δp: 0.026%
30
- 25.0% Δp: -1.244%
31
- 10.0% Δp: -6.629%
32
- 5.0% Δp: -11.886%
33
- 1.0% Δp: -25.558%
34
- 0.1% Δp: -49.037%
35
- Minimum Δp: -91.804%
36
- RMS Δp : 8.908 ± 0.037 %
37
- Same top p: 85.211 ± 0.094 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 7.516430 ± 0.047275
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 99.30%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.037872 ± 0.000742
6
+ Mean PPL(Q)/PPL(base) : 1.038599 ± 0.000771
7
+ Mean PPL(Q)-PPL(base) : 0.279341 ± 0.005741
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.034545 ± 0.000172
11
+ Maximum KLD: 4.479205
12
+ 99.9% KLD: 0.809954
13
+ 99.0% KLD: 0.243338
14
+ 99.0% KLD: 0.243338
15
+ Median KLD: 0.022288
16
+ 10.0% KLD: 0.001467
17
+ 5.0% KLD: 0.000485
18
+ 1.0% KLD: 0.000067
19
+ Minimum KLD: -0.000025
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -0.984 ± 0.014 %
23
+ Maximum Δp: 60.862%
24
+ 99.9% Δp: 23.910%
25
+ 99.0% Δp: 11.795%
26
+ 95.0% Δp: 5.549%
27
+ 90.0% Δp: 3.027%
28
+ 75.0% Δp: 0.404%
29
+ Median Δp: -0.103%
30
+ 25.0% Δp: -2.015%
31
+ 10.0% Δp: -6.046%
32
+ 5.0% Δp: -9.267%
33
+ 1.0% Δp: -18.431%
34
+ 0.1% Δp: -42.825%
35
+ Minimum Δp: -93.056%
36
+ RMS Δp : 5.270 ± 0.035 %
37
+ Same top p: 90.812 ± 0.076 %
scores/Watt-Tool-8B-iq4_nl.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 33.9564 +/- 2.6473
6
- Random chance: 28.5587 +/- 2.5250
7
 
8
 
9
- llama_perf_context_print: load time = 300.67 ms
10
- llama_perf_context_print: prompt eval time = 74415.31 ms / 17418 tokens ( 4.27 ms per token, 234.06 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 75640.51 ms / 17419 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 31.4667 +/- 1.6968
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 311.22 ms
10
+ llama_perf_context_print: prompt eval time = 166835.67 ms / 50072 tokens ( 3.33 ms per token, 300.13 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 168685.42 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-iq4_nl.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 71.8667 +/- 1.6430
6
 
7
- llama_perf_context_print: load time = 302.27 ms
8
- llama_perf_context_print: prompt eval time = 88132.28 ms / 22378 tokens ( 3.94 ms per token, 253.91 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 89331.17 ms / 22379 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-IQ4_NL.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 75.4667 +/- 1.5722
6
 
7
+ llama_perf_context_print: load time = 302.51 ms
8
+ llama_perf_context_print: prompt eval time = 72558.29 ms / 22192 tokens ( 3.27 ms per token, 305.85 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 73039.49 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_l.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 62.6506 +/- 1.7711
6
- Random chance: 25.0251 +/- 1.5859
7
 
8
 
9
- llama_perf_context_print: load time = 1624.57 ms
10
- llama_perf_context_print: prompt eval time = 171694.83 ms / 36304 tokens ( 4.73 ms per token, 211.44 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 173922.19 ms / 36305 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 61.7333 +/- 1.7759
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 1803.05 ms
10
+ llama_perf_context_print: prompt eval time = 123314.10 ms / 36600 tokens ( 3.37 ms per token, 296.80 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 124217.22 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_l.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
- 750 74.66666667
6
 
7
 
8
- llama_perf_context_print: load time = 308.34 ms
9
- llama_perf_context_print: prompt eval time = 469602.68 ms / 125000 tokens ( 3.76 ms per token, 266.18 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 475338.97 ms / 125001 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
+ 750 77.20000000% [74.0633%, 80.0595%]
6
 
7
 
8
+ llama_perf_context_print: load time = 293.80 ms
9
+ llama_perf_context_print: prompt eval time = 428872.85 ms / 126448 tokens ( 3.39 ms per token, 294.84 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 433161.99 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_l.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 36.0000 +/- 1.7539
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 330.46 ms
10
- llama_perf_context_print: prompt eval time = 271968.64 ms / 67808 tokens ( 4.01 ms per token, 249.32 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 274735.21 ms / 67809 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 38.5333 +/- 1.7783
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 293.76 ms
10
+ llama_perf_context_print: prompt eval time = 221450.17 ms / 67195 tokens ( 3.30 ms per token, 303.43 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 222861.86 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_l.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 9.923766 ± 0.061734
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 94.33%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.315713 ± 0.002108
6
- Mean PPL(Q)/PPL(base) : 1.371237 ± 0.002890
7
- Mean PPL(Q)-PPL(base) : 2.686677 ± 0.024105
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.292497 ± 0.000978
11
- Maximum KLD: 11.110364
12
- 99.9% KLD: 3.423501
13
- 99.0% KLD: 2.016132
14
- 99.0% KLD: 2.016132
15
- Median KLD: 0.210275
16
- 10.0% KLD: 0.019525
17
- 5.0% KLD: 0.005606
18
- 1.0% KLD: 0.000617
19
- Minimum KLD: 0.000004
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: -8.373 ± 0.044 %
23
- Maximum Δp: 78.738%
24
- 99.9% Δp: 39.788%
25
- 99.0% Δp: 20.797%
26
- 95.0% Δp: 7.364%
27
- 90.0% Δp: 2.736%
28
- 75.0% Δp: 0.005%
29
- Median Δp: -2.257%
30
- 25.0% Δp: -12.828%
31
- 10.0% Δp: -28.442%
32
- 5.0% Δp: -41.830%
33
- 1.0% Δp: -75.747%
34
- 0.1% Δp: -90.320%
35
- Minimum Δp: -99.555%
36
- RMS Δp : 18.605 ± 0.069 %
37
- Same top p: 74.871 ± 0.114 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 8.274172 ± 0.052402
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 97.60%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.133920 ± 0.001382
6
+ Mean PPL(Q)/PPL(base) : 1.143301 ± 0.001580
7
+ Mean PPL(Q)-PPL(base) : 1.037082 ± 0.012706
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.114738 ± 0.000483
11
+ Maximum KLD: 9.999102
12
+ 99.9% KLD: 2.236693
13
+ 99.0% KLD: 0.781076
14
+ 99.0% KLD: 0.781076
15
+ Median KLD: 0.077728
16
+ 10.0% KLD: 0.005170
17
+ 5.0% KLD: 0.001727
18
+ 1.0% KLD: 0.000289
19
+ Minimum KLD: -0.000055
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -3.288 ± 0.025 %
23
+ Maximum Δp: 65.548%
24
+ 99.9% Δp: 32.662%
25
+ 99.0% Δp: 17.193%
26
+ 95.0% Δp: 7.509%
27
+ 90.0% Δp: 3.610%
28
+ 75.0% Δp: 0.176%
29
+ Median Δp: -0.636%
30
+ 25.0% Δp: -5.421%
31
+ 10.0% Δp: -13.956%
32
+ 5.0% Δp: -20.435%
33
+ 1.0% Δp: -38.546%
34
+ 0.1% Δp: -71.826%
35
+ Minimum Δp: -98.746%
36
+ RMS Δp : 10.050 ± 0.048 %
37
+ Same top p: 83.354 ± 0.098 %
scores/Watt-Tool-8B-q3_k_l.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 35.4037 +/- 2.6692
6
- Random chance: 28.5968 +/- 2.5221
7
 
8
 
9
- llama_perf_context_print: load time = 307.41 ms
10
- llama_perf_context_print: prompt eval time = 81969.79 ms / 17455 tokens ( 4.70 ms per token, 212.94 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 83215.27 ms / 17456 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 32.4000 +/- 1.7100
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 303.59 ms
10
+ llama_perf_context_print: prompt eval time = 173066.86 ms / 50072 tokens ( 3.46 ms per token, 289.32 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 174906.65 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_l.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 72.8000 +/- 1.6260
6
 
7
- llama_perf_context_print: load time = 308.84 ms
8
- llama_perf_context_print: prompt eval time = 95382.87 ms / 22219 tokens ( 4.29 ms per token, 232.95 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 96563.37 ms / 22220 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_L.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 71.8667 +/- 1.6430
6
 
7
+ llama_perf_context_print: load time = 282.54 ms
8
+ llama_perf_context_print: prompt eval time = 75214.17 ms / 22192 tokens ( 3.39 ms per token, 295.05 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 75700.49 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_m.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 63.1016 +/- 1.7655
6
- Random chance: 25.0251 +/- 1.5848
7
 
8
 
9
- llama_perf_context_print: load time = 1629.74 ms
10
- llama_perf_context_print: prompt eval time = 172545.34 ms / 36557 tokens ( 4.72 ms per token, 211.87 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 174780.50 ms / 36558 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 61.0667 +/- 1.7816
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 1677.48 ms
10
+ llama_perf_context_print: prompt eval time = 120638.17 ms / 36600 tokens ( 3.30 ms per token, 303.39 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 121534.75 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_m.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
- 750 75.33333333
6
 
7
 
8
- llama_perf_context_print: load time = 308.43 ms
9
- llama_perf_context_print: prompt eval time = 462302.43 ms / 122058 tokens ( 3.79 ms per token, 264.02 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 467959.83 ms / 122059 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ 750 77.20000000% [74.0633%, 80.0595%]
6
 
7
 
8
+ llama_perf_context_print: load time = 283.76 ms
9
+ llama_perf_context_print: prompt eval time = 419917.27 ms / 126448 tokens ( 3.32 ms per token, 301.13 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 424216.79 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_m.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 38.8000 +/- 1.7805
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 324.30 ms
10
- llama_perf_context_print: prompt eval time = 282248.57 ms / 70434 tokens ( 4.01 ms per token, 249.55 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 285034.26 ms / 70435 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 38.5333 +/- 1.7783
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 285.00 ms
10
+ llama_perf_context_print: prompt eval time = 216659.18 ms / 67195 tokens ( 3.22 ms per token, 310.14 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 218063.84 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_m.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 9.855009 ± 0.061412
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 94.48%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.308761 ± 0.002082
6
- Mean PPL(Q)/PPL(base) : 1.361736 ± 0.002836
7
- Mean PPL(Q)-PPL(base) : 2.617919 ± 0.023684
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.292188 ± 0.000927
11
- Maximum KLD: 8.990653
12
- 99.9% KLD: 3.324761
13
- 99.0% KLD: 1.855327
14
- 99.0% KLD: 1.855327
15
- Median KLD: 0.215862
16
- 10.0% KLD: 0.020007
17
- 5.0% KLD: 0.005427
18
- 1.0% KLD: 0.000564
19
- Minimum KLD: 0.000000
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: -8.052 ± 0.043 %
23
- Maximum Δp: 83.209%
24
- 99.9% Δp: 40.828%
25
- 99.0% Δp: 21.532%
26
- 95.0% Δp: 8.004%
27
- 90.0% Δp: 2.972%
28
- 75.0% Δp: 0.009%
29
- Median Δp: -2.217%
30
- 25.0% Δp: -12.532%
31
- 10.0% Δp: -27.827%
32
- 5.0% Δp: -40.790%
33
- 1.0% Δp: -72.837%
34
- 0.1% Δp: -88.971%
35
- Minimum Δp: -99.209%
36
- RMS Δp : 18.125 ± 0.066 %
37
- Same top p: 74.435 ± 0.115 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 8.459379 ± 0.053550
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 97.26%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.156057 ± 0.001477
6
+ Mean PPL(Q)/PPL(base) : 1.168892 ± 0.001727
7
+ Mean PPL(Q)-PPL(base) : 1.222289 ± 0.014061
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.131196 ± 0.000539
11
+ Maximum KLD: 7.898368
12
+ 99.9% KLD: 2.475934
13
+ 99.0% KLD: 0.894390
14
+ 99.0% KLD: 0.894390
15
+ Median KLD: 0.089346
16
+ 10.0% KLD: 0.006670
17
+ 5.0% KLD: 0.002250
18
+ 1.0% KLD: 0.000392
19
+ Minimum KLD: 0.000001
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -3.913 ± 0.027 %
23
+ Maximum Δp: 64.023%
24
+ 99.9% Δp: 33.075%
25
+ 99.0% Δp: 17.214%
26
+ 95.0% Δp: 7.245%
27
+ 90.0% Δp: 3.301%
28
+ 75.0% Δp: 0.096%
29
+ Median Δp: -0.875%
30
+ 25.0% Δp: -6.342%
31
+ 10.0% Δp: -15.578%
32
+ 5.0% Δp: -22.649%
33
+ 1.0% Δp: -41.943%
34
+ 0.1% Δp: -75.852%
35
+ Minimum Δp: -98.926%
36
+ RMS Δp : 10.892 ± 0.050 %
37
+ Same top p: 82.438 ± 0.100 %
scores/Watt-Tool-8B-q3_k_m.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 35.3846 +/- 2.6565
6
- Random chance: 28.4838 +/- 2.5074
7
 
8
 
9
- llama_perf_context_print: load time = 305.46 ms
10
- llama_perf_context_print: prompt eval time = 83329.07 ms / 17627 tokens ( 4.73 ms per token, 211.53 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 84576.13 ms / 17628 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 33.3333 +/- 1.7225
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 294.56 ms
10
+ llama_perf_context_print: prompt eval time = 169427.06 ms / 50072 tokens ( 3.38 ms per token, 295.54 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 171263.15 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_m.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 72.6667 +/- 1.6284
6
 
7
- llama_perf_context_print: load time = 331.54 ms
8
- llama_perf_context_print: prompt eval time = 94938.20 ms / 22104 tokens ( 4.30 ms per token, 232.83 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 96106.12 ms / 22105 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 73.0667 +/- 1.6209
6
 
7
+ llama_perf_context_print: load time = 286.93 ms
8
+ llama_perf_context_print: prompt eval time = 73604.03 ms / 22192 tokens ( 3.32 ms per token, 301.51 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 74091.88 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_s.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 61.7135 +/- 1.7797
6
- Random chance: 25.0251 +/- 1.5859
7
 
8
 
9
- llama_perf_context_print: load time = 1613.35 ms
10
- llama_perf_context_print: prompt eval time = 172264.72 ms / 36428 tokens ( 4.73 ms per token, 211.47 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 174482.10 ms / 36429 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 58.2667 +/- 1.8018
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 1653.99 ms
10
+ llama_perf_context_print: prompt eval time = 123080.79 ms / 36600 tokens ( 3.36 ms per token, 297.37 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 123982.14 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_s.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
- 750 74.00000000
6
 
7
 
8
- llama_perf_context_print: load time = 346.89 ms
9
- llama_perf_context_print: prompt eval time = 471049.03 ms / 125576 tokens ( 3.75 ms per token, 266.59 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 476803.59 ms / 125577 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
+ 750 75.60000000% [72.4008%, 78.5383%]
6
 
7
 
8
+ llama_perf_context_print: load time = 285.75 ms
9
+ llama_perf_context_print: prompt eval time = 427986.37 ms / 126448 tokens ( 3.38 ms per token, 295.45 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 432278.94 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_s.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 37.2000 +/- 1.7661
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 312.54 ms
10
- llama_perf_context_print: prompt eval time = 279002.34 ms / 69611 tokens ( 4.01 ms per token, 249.50 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 281794.77 ms / 69612 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 38.1333 +/- 1.7748
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 283.79 ms
10
+ llama_perf_context_print: prompt eval time = 221006.99 ms / 67195 tokens ( 3.29 ms per token, 304.04 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 222418.23 ms / 67196 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_s.ppx CHANGED
@@ -1,37 +1,37 @@
1
  ====== Perplexity statistics ======
2
- Mean PPL(Q) : 9.798719 ± 0.061509
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
- Cor(ln(PPL(Q)), ln(PPL(base))): 94.51%
5
- Mean ln(PPL(Q)/PPL(base)) : 0.303032 ± 0.002083
6
- Mean PPL(Q)/PPL(base) : 1.353958 ± 0.002821
7
- Mean PPL(Q)-PPL(base) : 2.561629 ± 0.023725
8
 
9
  ====== KL divergence statistics ======
10
- Mean KLD: 0.285340 ± 0.000933
11
- Maximum KLD: 8.015823
12
- 99.9% KLD: 3.395762
13
- 99.0% KLD: 1.885579
14
- 99.0% KLD: 1.885579
15
- Median KLD: 0.207489
16
- 10.0% KLD: 0.018084
17
- 5.0% KLD: 0.005005
18
- 1.0% KLD: 0.000535
19
- Minimum KLD: 0.000000
20
 
21
  ====== Token probability statistics ======
22
- Mean Δp: -7.606 ± 0.042 %
23
- Maximum Δp: 79.040%
24
- 99.9% Δp: 41.873%
25
- 99.0% Δp: 22.236%
26
- 95.0% Δp: 8.510%
27
- 90.0% Δp: 3.272%
28
- 75.0% Δp: 0.019%
29
- Median Δp: -1.956%
30
- 25.0% Δp: -11.831%
31
- 10.0% Δp: -26.778%
32
- 5.0% Δp: -39.549%
33
- 1.0% Δp: -72.463%
34
- 0.1% Δp: -89.273%
35
- Minimum Δp: -99.255%
36
- RMS Δp : 17.751 ± 0.067 %
37
- Same top p: 74.688 ± 0.115 %
 
1
  ====== Perplexity statistics ======
2
+ Mean PPL(Q) : 8.869361 ± 0.056188
3
  Mean PPL(base) : 7.237090 ± 0.045539
4
+ Cor(ln(PPL(Q)), ln(PPL(base))): 96.40%
5
+ Mean ln(PPL(Q)/PPL(base)) : 0.203384 ± 0.001694
6
+ Mean PPL(Q)/PPL(base) : 1.225543 ± 0.002076
7
+ Mean PPL(Q)-PPL(base) : 1.632272 ± 0.017247
8
 
9
  ====== KL divergence statistics ======
10
+ Mean KLD: 0.171689 ± 0.000675
11
+ Maximum KLD: 8.647476
12
+ 99.9% KLD: 3.093943
13
+ 99.0% KLD: 1.167801
14
+ 99.0% KLD: 1.167801
15
+ Median KLD: 0.116922
16
+ 10.0% KLD: 0.009604
17
+ 5.0% KLD: 0.003321
18
+ 1.0% KLD: 0.000607
19
+ Minimum KLD: 0.000004
20
 
21
  ====== Token probability statistics ======
22
+ Mean Δp: -5.020 ± 0.030 %
23
+ Maximum Δp: 68.248%
24
+ 99.9% Δp: 34.217%
25
+ 99.0% Δp: 17.833%
26
+ 95.0% Δp: 7.053%
27
+ 90.0% Δp: 2.939%
28
+ 75.0% Δp: 0.031%
29
+ Median Δp: -1.308%
30
+ 25.0% Δp: -7.961%
31
+ 10.0% Δp: -18.717%
32
+ 5.0% Δp: -26.619%
33
+ 1.0% Δp: -49.119%
34
+ 0.1% Δp: -82.457%
35
+ Minimum Δp: -99.092%
36
+ RMS Δp : 12.587 ± 0.055 %
37
+ Same top p: 80.614 ± 0.104 %
scores/Watt-Tool-8B-q3_k_s.tqa CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 35.6707 +/- 2.6490
6
- Random chance: 28.6213 +/- 2.4995
7
 
8
 
9
- llama_perf_context_print: load time = 315.60 ms
10
- llama_perf_context_print: prompt eval time = 83571.96 ms / 17852 tokens ( 4.68 ms per token, 213.61 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 84826.79 ms / 17853 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 33.2000 +/- 1.7207
6
+ Random chance: 19.8992 +/- 1.4588
7
 
8
 
9
+ llama_perf_context_print: load time = 290.87 ms
10
+ llama_perf_context_print: prompt eval time = 172697.11 ms / 50072 tokens ( 3.45 ms per token, 289.94 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 174519.26 ms / 50073 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q3_k_s.wng CHANGED
@@ -1,11 +1,11 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
- Final Winogrande score(750 tasks): 71.8667 +/- 1.6430
6
 
7
- llama_perf_context_print: load time = 314.44 ms
8
- llama_perf_context_print: prompt eval time = 96371.78 ms / 22317 tokens ( 4.32 ms per token, 231.57 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
- llama_perf_context_print: total time = 97549.79 ms / 22318 tokens
11
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q3_K_S.gguf (version GGUF V3 (latest))
4
 
5
+ Final Winogrande score(750 tasks): 73.6000 +/- 1.6106
6
 
7
+ llama_perf_context_print: load time = 290.68 ms
8
+ llama_perf_context_print: prompt eval time = 75062.12 ms / 22192 tokens ( 3.38 ms per token, 295.65 tokens per second)
9
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
10
+ llama_perf_context_print: total time = 75536.06 ms / 22193 tokens
11
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q4_k_m.arc CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 64.6586 +/- 1.7502
6
- Random chance: 25.0335 +/- 1.5861
7
 
8
 
9
- llama_perf_context_print: load time = 2144.99 ms
10
- llama_perf_context_print: prompt eval time = 164022.15 ms / 37149 tokens ( 4.42 ms per token, 226.49 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 166313.65 ms / 37150 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 65.7333 +/- 1.7342
6
+ Random chance: 25.0083 +/- 1.5824
7
 
8
 
9
+ llama_perf_context_print: load time = 2082.07 ms
10
+ llama_perf_context_print: prompt eval time = 124311.12 ms / 36600 tokens ( 3.40 ms per token, 294.42 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 125214.84 ms / 36601 tokens
13
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q4_k_m.hsw CHANGED
@@ -1,12 +1,12 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
- 750 75.33333333
6
 
7
 
8
- llama_perf_context_print: load time = 293.62 ms
9
- llama_perf_context_print: prompt eval time = 433672.45 ms / 123896 tokens ( 3.50 ms per token, 285.69 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
- llama_perf_context_print: total time = 439340.46 ms / 123897 tokens
12
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ 750 77.73333333% [74.6188%, 80.5653%]
6
 
7
 
8
+ llama_perf_context_print: load time = 309.44 ms
9
+ llama_perf_context_print: prompt eval time = 431270.97 ms / 126448 tokens ( 3.41 ms per token, 293.20 tokens per second)
10
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
11
+ llama_perf_context_print: total time = 435567.98 ms / 126449 tokens
12
  ggml_metal_free: deallocating
scores/Watt-Tool-8B-q4_k_m.mmlu CHANGED
@@ -1,13 +1,13 @@
1
- build: 4945 (e354bc3b) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.3.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
- Final result: 40.2667 +/- 1.7920
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
- llama_perf_context_print: load time = 297.85 ms
10
- llama_perf_context_print: prompt eval time = 262597.88 ms / 70659 tokens ( 3.72 ms per token, 269.08 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
- llama_perf_context_print: total time = 265393.62 ms / 70660 tokens
13
  ggml_metal_free: deallocating
 
1
+ build: 5150 (2db9ba14) with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
2
  llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
3
  llama_model_loader: loaded meta data with 40 key-value pairs and 292 tensors from ./Watt-Tool-8B-Q4_K_M.gguf (version GGUF V3 (latest))
4
 
5
+ Final result: 39.4667 +/- 1.7860
6
  Random chance: 25.0000 +/- 1.5822
7
 
8
 
9
+ llama_perf_context_print: load time = 297.51 ms
10
+ llama_perf_context_print: prompt eval time = 223210.15 ms / 67195 tokens ( 3.32 ms per token, 301.04 tokens per second)
11
  llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
12
+ llama_perf_context_print: total time = 224621.22 ms / 67196 tokens
13
  ggml_metal_free: deallocating