Great idea!

#1
by mwettach - opened

I am really interested in machine translation from a consumer perspective, but until now I have found that small LLMs do not achieve optimal performance with translation tasks, especially with target languages other than English or Chinese. I really appreciate your putting in the effort to train a model especially for translation and am looking forward to testing this. MRadermacher has already produced a GGUF version.

Apparently your other version is based on Qwen-2.5-72B, which will probably run very slowly on consumer hardware. But then, translations are not necessarily time critical and I could run this overnight...

I assume you started several months ago, otherwise you would have chosen Qwen-3 as a basis, right? A 32B version could be a viable compromise.

Unbabel org

Hi,
Thanks for your feedback! Let us know how your experiments go.
Yes, at the time of development Qwen3 was not available.

First impressions are very promising. I provided a passage (8 kB of text) from an older English bible commentary for translation to German.
The prompt instructions included some not-so-easy tasks:

  1. Improve the antiquated sentence structure and wording to a more modern style with shorter sentences and newspaper-like wording.
  2. Leave in-line footnote indexes [a], [b] etc. in the translation and put them in the approximately equivalent place.
  3. Leave Hebrew or Greek words in the original language, do not translate them.
  4. Translate the complete text, do not add anything and leave nothing out.

Many other small open-source LLMs did not even manage to give me the full text and skipped over several passages. Some also hallucinated additional text.

From MRadermacher I used the 8-bit GGUF quantized version of Tower-Plus-9B and the 4Bit-K-S Version of Tower-Plus-72B and imported them into Ollama with temperature=0.1 .

  • Tower-Plus-9B managed to mostly comply with intructions 2-4, but not with instruction 1: it replicated quaint sentences with many semicolons and sometimes chose inferior or unusual wording, e. g. translating "husbandman" with "Ackermann" instead of the expected modern translation "Bauer". The output for each run contains the same translation twice (this happened on both of my test runs), could this be a systematic error? I found a minor omission (the text "as Nachmanides observes" was missing in the translation on both runs), and I found a minor grammatical mistake in the output ("der einzige Objekt", should be "das einzige Objekt"), but overall the quality of the translation was much better than what I have seen before from models of this size. I did not find any hallucinated text. The 8 kB input was translated in about 1 minute on my consumer-grade machine with 16 GB of GPU memory.

  • Tower-Plus-72B managed to do very well with all instructions on the first run, even outperforming ChatGPT 4o-mini on the same task. I found a minor omission (the text "many of the Jewish interpreters, as Aben Ezra, understand..." was rendered without the reference ", as Aben Ezra, "), but no grammatical errors or hallucinations. One footnote index was missing. When I ran it again in the same Ollama conversation, the translation left out one complete sentence and most footnote indexes. Wording and paragraph breaks changed a little between runs (due to temperature=0.1), but still gave the original meaning. The 8 kB input was translated in about 25 minutes (perhaps being overly prudent, I set the GPU offload to only 10 layers during conversion).

This is my own "human eval" (I am a German native speaker with many years of exposition to English).

Sign up or log in to comment