BarBarickoza/utrobinmv_t5_translate_en_ru_zh_v2-extended

This is a T5 model I fine-tuned starting from the utrobinmv/t5_translate_en_ru_zh_large_1024_v2 checkpoint. My primary goal was to improve its ability to handle longer English to Russian translations, particularly for literature. I used around 320MB of text data with about 80,000 translation pairs taken from books—mostly fiction, though some non-fiction pieces were included as well. The original model struggled with long inputs, often ignoring parts of the text if it became too extensive. I believe this limitation arose because the dataset it was trained on lacked sufficient examples of lengthy passages. By fine-tuning on this new dataset, I aimed to enhance its capacity to process extended texts more effectively, even if the direct quality of the translation itself might not have improved as much.

During training, I used the <newline> token to represent newlines in the text, simply replacing \n with <newline> in the input data. After the model generates its output, I then replace <newline> back with \n. This approach seems to work reasonably well, though it's not perfect—the model occasionally outputs unexpected tokens instead of properly handling newlines. Interestingly, after testing the model post-training, I noticed it started adapting English dialogue styles to better fit Russian conventions, which wasn't something I explicitly aimed for but turned out to be a nice surprise. However, there is still an issue: in the dataset, some pronouns (e.g., 'she' or 'he') were replaced with character names when contextual information was removed during preprocessing. As a result, the model sometimes swaps pronouns for names in its translations, even when the original English text doesn't mention any names. This is definitely a limitation of the dataset that I couldn't fully resolve yet.

While I haven't conducted any formal evaluations, I feel that the model's ability to translate longer texts has significantly improved. Even if the overall translation quality hasn't risen dramatically, the model now handles extended passages much better than before, which was my main objective. As for other language pairs like RU->EN or anything involving Chinese, I haven't explored those aspects, so I can't say how the fine-tuning affected them. The entire project was done on a very limited budget, and most of the dataset preparation happened during my spare time. If anyone wants to do a proper benchmark of this model, I'd be genuinely curious to see the results—it would be interesting to learn how much of a difference my adjustments made.

An example of the difference between models with a translation of this text:

Original text

Furina wears a dark blue suit-like outfit over a vest with a jabot, all decorated with multiple blue cut stone accessories that resemble water droplets at several places. A blue sash is laid over the vest, tied into a bow and pinned with a large blue gem in the center that is later replaced with her Hydro Vision. Her cuffs are folded back, hemmed into brass-like pieces, and her sleeves end with two deep blue frills, one on each arm. She also wears a dark blue top hat tipped to the left of her head, the hat decorated with metal flourishes giving the hat a crown-like appearance, in addition to including her emblem on the top surface. Multiple other fabric ornaments decorate her outfit, including blue and white pieces of cloth on her trailing behind her waist and hat. Her Pneuma form has all the white colors in her dress, changing into a dark-blue color in her Ousia form.

The text translated into Russian by the base model

Она также носит темно-синюю верхнюю шляпу на кончике слева от головы, шляпа украшена металлическими цветами, придающими шляпе коронный вид, в дополнение к включению ее эмблемы на верхней поверхности.

As you can see, the base model omits significant portions of the text, losing both the beginning and the end.

The text translated into Russian by my model

Фурина носит темно-синий костюм над жилетом с жабо, украшенный множеством синих каменных аксессуаров, похожих на капли воды в нескольких местах. Поверх жилета положен голубой ремешок, привязанный к носу и заколотый большим синим драгоценным камнем в центре, который позже заменили ее «Гидровидение». Ее манжеты сложены назад, обрезаны медными кусочками, а рукава заканчиваются двумя глубокими синими оборками, по одной на каждой руке. Она также носит темно-синюю цилиндровую шляпу, подвешенную слева от головы; шляпа украшена металлическими цветами, придающими шляпе коронообразный вид, помимо эмблемы на верхней поверхности. Множество других декоративных тканей украшают ее одежду, включая синие и белые куски ткани позади пояса и шляпы. Ее форма Пнеума имеет все белые цвета в платье, переходя в темно-голубой цвет в форме Узии.

By comparing the two translations, you can see that my model not only preserves the entire passage but also captures more details from the original text.

BarBarickoza
/

utrobinmv_t5_translate_en_ru_zh_v2-extended

Model tree for BarBarickoza/utrobinmv_t5_translate_en_ru_zh_v2-extended