Can't determine properly which is greater between 9.9 and 9.11
Guys, they say 3.1 beats GPT-4o... But it doesn't seem like that...
Here's the answer to a simple math question:
To determine which is greater between 9.9 and 9.11, you can compare the numbers directly.
- 9.11 has a tenths place of 1 and a hundredths place of 1.
- 9.9 has a tenths place of 9 and no hundredths place (which is equivalent to having a hundredths place of 0).
Since the tenths place in 9.11 is greater than that in 9.9, we can conclude that:
**9.11 is greater than 9.9.**
Also tried with a prompt that makes the model reasoning first, but still no luck...
<details type="reasoning" done="true" duration="25">
<summary>Thought for 25 seconds</summary>
> 1. The user is asking which number is greater between 9.9 and 9.11.
> 2. Break the task into parts:
> - Compare the two numbers.
> 3. Assign each part to an expert:
> - Number comparison: Mathematics Expert.
> 4. Combine their insights:
>
> Mathematics Expert: To compare the numbers, you need to look at each decimal place from left to right.
>
> - 9 is equal to 9 in both cases.
> - The tenths place (the first digit after the decimal point) is 9 in the first number and 1 in the second number. Since 1 is less than 9, we don't need to look further.
>
> So, the second number (9.11) is greater than the first (9.9).
</details>
The greater number between 9.9 and 9.11 is **9.11**.
Actually the reasoning is correct, but at the end it gives wrong answer...
That's sad but it's true...
Number Comparison:
The Mathematics Expert will compare the two numbers.
9.9 vs. 9.11:
Both numbers have the same integer part, which is 9.
Comparing the decimal parts: 0.9 (from 9.9) vs. 0.11 (from 9.11).
Since 0.11 is greater than 0.9, 9.11 is greater than 9.9.
Conclusion:
9.11 is greater than 9.9.
I tried few times and it seems it wrongly converts 0.9 to 0.09... that's why it declares 9.11 is greater
Asking this question fo DeepSeek R1 1.5B and it manages to get the right answer...
this is a tokenizer problem. currently, we tokenize a bunch of numbers in one token.
So that for example 69
would be read as a single token, which improves efficiency but decreases number understanding.
this is a common thing in almost all LLMs, just like letter counting, which is also largely impossible due to the tokenizer.
so this will not improve, unless specifically trained on, reducing the performance in other areas.
sooooo, use code to determine if a number is larger than another, LLMs are overkill.
@Smorty100 Well actually most of the modern tokenizers encode digits to single tokens, this is true for this model too. It takes like 3 lines of code to test:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('unsloth/Mistral-Small-3.1-24B-Instruct-2503-unsloth-bnb-4bit')
print(tokenizer.encode('9.11'))
Guys, i just tested it on this question, but online, on openrouter website, and no matter how many retries, it always answers correctly. Any ideas?
There are several example questions, and their version is "9.11 and 9.9, which one is larger?".