Ruadapt

non-profit

AI & ML interests

NLP, LLM

Recent Activity

RefalMachine  updated a Space 7 days ago
Ruadapt/README
RefalMachine  updated a collection 7 days ago
RuadaptQwen2.5
RefalMachine  updated a collection 7 days ago
RuadaptQwen2.5
View all activity

Description

Ruadapt is a project focused on developing the methodology for adapting large language models (LLMs) to the Russian language, with a change in tokenization to enhance model efficiency. It is important to note that the methodology is applicable to practically any language, as it does not employ any language-dependent methods.

In addition to developing the methodology itself, we also employ it to adapt existing SOTA open-source models and make them publicly available. For example, our series of RuadaptQwen2.5 models generate Russian-language text 30-60% faster (in terms of characters) due to more suitable tokenization, while minimizing quality loss on both English and Russian languages.

One of the unique features of our approach to adaptation lies in the fact that, thanks to the LEP method - Learned Embedding Propagation (see paper), we adapt the base version of the model just once and can then very affordably adapt any instructive version derived from this base. For instance, after adapting Qwen2.5-32B, we managed to obtain RaadaptQwen2.5 versions not only for Qwen2.5-32B-Instruct but also for QwQ-32B-Preview, QwQ-32B, FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview (while preserving reasoning capabilities), and T-pro-it-1.0.

An intriguing aspect of adapting T-pro-it-1.0 is that this model was obtained through continuous pretraining on over 100 billion tokens of Russian-language data using full fine-tuning. Despite this extensive prior training, our methodology still worked effectively (note: the original base model Qwen2.5-32B was adapted!), and the resulting adapted version either outperformed or matched T-pro-it-1.0 on several benchmarks. Moreover, it demonstrated higher efficiency in Russian-language tokenization.

For adaptation, we use sampling from a combination of the open datasets HuggingFaceFW/fineweb-2 and IlyaGusev/rulm. The study of the impact of data volume and quality on the current process is ongoing.

Papers

Tikhomirov M., Chernyshov D. Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation //Journal of Language and Education. – 2024. – Т. 10. – №. 4. – С. 130-145. (Preprint: https://arxiv.org/abs/2412.21140)

Tikhomirov M., Chernyshev D. Improving Large Language Model Russian adaptation with preliminary vocabulary optimization //Lobachevskii Journal of Mathematics. – 2024. – Т. 45. – №. 7. – С. 3211-3219.

Tikhomirov M., Chernyshev D. Impact of Tokenization on LLaMa Russian Adaptation //2023 Ivannikov Ispras Open Conference (ISPRAS). – IEEE, 2023. – С. 163-168.

models

None public yet

datasets

None public yet