attn-signs
/

Watari-32b-v2

@@ -1,199 +1,194 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+datasets:
+- Vikhrmodels/GrandMaster-PRO-MAX
+- attn-signs/kolmogorov-3
+- attn-signs/russian-code
 ---
+# Watari 32B (V2)
+- [EN]
+Qwen2.5-based model, adapted for russian text generation tasks.
+The model has extended tokenizer and proper adapted chat template.
+The model was trained using LoRA adapters.
+The model was trained for **2 stages**
+- [RU]
+Finetune версия Qwen2.5, адаптированная для генерации русского текста.
+Модель имеет расширенный токенайзер и правильный адаптированный чат темплейт (произведена работа над ошибками).
+Модель была обучена с использованием низкоранговых адаптеров LoRA.
+Модель была обучена в **2 стадии**
+### Previous models (considering parameters / states):
+- Watari-7b-v1
+- Watari-32b-v0
+## Model Details / Детализация модели
+- [EN]
+LoRA supervised finetuning version was performed on **2xA100 NVIDIA** GPUs for **~8 days**.
+**Datasets used:**
+- GrandMaster [Vikhrmodels/GrandMaster-PRO-MAX] (0.6 epochs)
+- Kolmogorov-3 [attn-signs/kolmogorov-3] (1 epochs)
+- Russian Code [attn-signs/russian/code] (1 epochs)
+**Extensions:**
+The model has extended tokenizer based on arxiv paper and works of RefalMachine (RuAdapt / Moscow State University).
+**Huge thanks to Mikhail Tikhomirov for hard scientific work and tokenizer extension methods developed.**
+The model generation in russian is 60% more cheaper and faster due to the extended tokenizer (see the research at the end).
+- [RU]
+SFT LoRA обучение было выполнено на **двух NVIDIA A100**, обучение длилось около **8 дней**.
+**Использованные датасеты:**
+- GrandMaster [Vikhrmodels/GrandMaster-PRO-MAX] (0.6 эпохи)
+- Kolmogorov-3 [attn-signs/kolmogorov-3] (1 эпоха)
+- Russian Code [attn-signs/russian/code] (1 эпоха)
+Модель имеет расширенный токенайзер, метод основан на arxiv статье и работах RefalMachine (RuAdapt / Московский Государственный Университет).
+**Выражаю большое уважение Михаилу Тихомирову за его научные работы и методы расширения токенайзера.**
+Генерация модели, благодаря методу на 60% более быстрая и менее дорогая (см. исследование токенайзера в конце статьи).
+### Model Description / Описание модели
+- **Developed by:** [Reisen Raumberg (Attention Signs team)]
+- **Language(s) (NLP):** [RU/EN]
+- **Finetuned from model:** [Qwen2.5]
+**Distributed training:**
+- DeepSpeed (Stage 3)
+- HuggingFace Accelerator
+**Fusion:**
+- Flash Attention 2
+- Fused AdamW
+- Liger Kernel (swiglu, fused linear xentropy)
+-
+**GPU hours**: ~384 of NVIDIA A100
+### Training configuration / Конфигурация обучения
+**The model was trained using MyLLM framework:**
+--== [MyLLM](https://github.com/Raumberg/myllm) ==--
+**Model training / Обучение модели**
+The model was trained utilizing 2 stages:
+- Stage 1:
+  - Datasets: GrandMaster, LoRA: rank=128, alpha=256
+- Stage 2:
+  - Datasets: Kolmogorov-3, Russian Code, LoRA: rank=256, alpha=256
+**All configs are available in MyLLM repository.**
+### Using the model / Как запустить?
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+repo = 'attn-signs/Watari-32b-v2'
+model = AutoModelForCausalLM.from_pretrained(repo)
+tokenizer = AutoTokenizer.from_pretrained(repo)
+model.to('cuda')
+prompt = 'Расскажи мне про теорию категорий. Как мне построить коммутативную диаграмму? Знаешь ли ты что-то про театры ходжа-аракелова?'
+messages = [
+    {"role": "system", "content": "Ты Ватари, ассистент и помощник в решении различных задач. Отвечай на вопросы пользователя, рассуждая."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+```
+Теория категорий — это раздел математики, который изучает абстрактные структуры и их взаимосвязи. Она предоставляет общий язык для описания математических объектов и преобразований между ними. Основные понятия теории категорий включают категории, объекты, морфизмы (стрелки), композиции морфизмов и коммутативные диаграммы.
+**Коммутативная диаграмма** — это визуальное представление того, как различные морфизмы взаимодействуют друг с другом. Чтобы построить её:
+1. **Определите объекты**: Выберите математические объекты (например, множества, группы, векторные пространства), которые будут вершинами диаграммы.
+2. **Укажите морфизмы**: Нарисуйте стрелки между объектами, обозначающие функции или отображения между ними.
+3. **Убедитесь в коммутативности**: Убедитесь, что для любой пары путей из одной вершины в другую, композиция морфизмов вдоль этих путей дает одинаковое преобразование. Это означает, что независимо от выбранного пути результат будет одинаковым.
+Пример: Если у вас есть объекты A, B, C и морфизмы f: A → B, g: B → C, h: A → C, то диаграмма коммутирует, если g ∘ f = h.
+Для построения диаграмм часто используют программы вроде TikZ в LaTeX, где можно точно указать расположение объектов и стрелки между ними.
+Что касается **театров Ходжа-Аракелова**, я не располагаю информацией о таком понятии. Возможно, имеется в виду теория Ходжа или Аракеловская геометрия, но "театры" не являются стандартным термином в математике. Если вы имеете в виду конкретную концепцию или предмет исследования, уточните, пожалуйста, чтобы я мог предоставить точный ответ.
+```
+### Tokenizer research / Исследование токенайзера:
+You can verify and see the internals of tokenization yourself by the python code provided below:
+Можно рассмотреть внутренности токенизации самостоятельно, для этого прилагается следующий python код:
+```python
+input_text = "Привет! Я Ватари, интеллектуальный помощник в решении различных задач."
+# Tokenize
+tokenized = tokenizer(input_text, return_tensors="pt", return_offsets_mapping=True)
+tokens = tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
+# Print raw tokens and decoded versions
+print("Tokenization Analysis:\n")
+for i, (token, offset) in enumerate(zip(tokens, tokenized.offset_mapping[0])):
+    # Get start/end positions in original text
+    start, end = offset.tolist()
+    original_slice = input_text[int(start):int(end)]
+    # Clean token representation and replace Ġ (which represent the whitespace)
+    cleaned_token = token.replace('Ġ', ' ').replace('▁', ' ')
+    print(f"Token {i}:")
+    print(f"  Raw: {token}")
+    print(f"  Cleaned: {cleaned_token}")
+    print(f"  Decoded: {tokenizer.decode(tokenized['input_ids'][0][i])}")
+    print(f"  Original text slice: '{original_slice}'")
+    print(f"  Byte representation: {list(token.encode('utf-8'))}")
+    print("-" * 50)
+# Verify full reconstruction
+print("\nFull Reconstruction:", tokenizer.decode(tokenized["input_ids"][0]))
+```
+**Output / Результат:**
+```
+...
+--------------------------------------------------
+Token 8:
+  Raw: ĠÐ¸Ð½ÑĤÐµÐ»Ð»ÐµÐºÑĤ
+  Cleaned:  Ð¸Ð½ÑĤÐµÐ»Ð»ÐµÐºÑĤ
+  Decoded:  интеллект
+  Original text slice: ' интеллект'
+  Byte representation: [196, 160, 195, 144, 194, 184, 195, 144, 194, 189, 195, 145, 196, 164, 195, 144, 194, 181, 195, 144, 194, 187, 195, 144, 194, 187, 195, 144, 194, 181, 195, 144, 194, 186, 195, 145, 196, 164]
+--------------------------------------------------
+Token 9:
+  Raw: Ñĥ
+  Cleaned: Ñĥ
+  Decoded: у
+  Original text slice: 'у'
+  Byte representation: [195, 145, 196, 165]
+...
+Token 13:
+  Raw: ĠÑĢÐµÑĪÐµÐ½Ð¸Ð¸
+  Cleaned:  ÑĢÐµÑĪÐµÐ½Ð¸Ð¸
+  Decoded:  решении
+  Original text slice: ' решении'
+  Byte representation: [196, 160, 195, 145, 196, 162, 195, 144, 194, 181, 195, 145, 196, 170, 195, 144, 194, 181, 195, 144, 194, 189, 195, 144, 194, 184, 195, 144, 194, 184]
+--------------------------------------------------
+Token 14:
+  Raw: ĠÑĢÐ°Ð·Ð»Ð¸ÑĩÐ½ÑĭÑħ
+  Cleaned:  ÑĢÐ°Ð·Ð»Ð¸ÑĩÐ½ÑĭÑħ
+  Decoded:  различных
+  Original text slice: ' различных'
+  Byte representation: [196, 160, 195, 145, 196, 162, 195, 144, 194, 176, 195, 144, 194, 183, 195, 144, 194, 187, 195, 144, 194, 184, 195, 145, 196, 169, 195, 144, 194, 189, 195, 145, 196, 173, 195, 145, 196, 167]
+--------------------------------------------------
+Full Reconstruction: Привет! Я Ватари, интеллектуальный помощник в решении различных задач.
+```