Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +183 -3
config.json +41 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,183 @@
----
-license: cc-by-sa-4.0
----

+# rubert_tiny2_russian_emotion_sentiment
+## Описание
+Модель `rubert_tiny2_russian_emotion_sentiment` — это дообученная версия легковесной модели [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2) для классификации пяти эмоций в русскоязычных сообщениях:
+- **0**: aggression (агрессия)
+- **1**: anxiety (тревожность)
+- **2**: neutral (нейтральное состояние)
+- **3**: positive (позитив)
+- **4**: sarcasm (сарказм)
+### Результаты на валидации
+| Метрика    | Значение |
+|------------|----------|
+| Accuracy   | 0.8911   |
+| F1 macro   | 0.8910   |
+| F1 micro   | 0.8911   |
+**Точность по классам**:
+- агрессия    (0): 0.9120
+- тревожность (1): 0.9462
+- нейтральное (2): 0.8663
+- позитив     (3): 0.8884
+- сарказм     (4): 0.8426
+### Использование
+```bash
+pip install transformers torch
+```
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Загружаем модель и токенизатор
+MODEL_ID = "Kostya165/rubert_tiny2_russian_emotion_sentiment"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model     = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+model.eval()
+texts = [
+    "Сегодня отличный день!",
+    "Меня это всё бесит и раздражает."
+]
+# Токенизация
+enc = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**enc).logits
+    preds = logits.argmax(dim=-1).tolist()
+# Преобразуем ID обратно в метки
+id2label = model.config.id2label
+labels = [id2label[p] for p in preds]
+print(labels)  # например: ['positive', 'aggression']
+```
+### Как было обучено
+- **База**: `cointegrated/rubert-tiny2`
+- **Датасет**: `Kostya165/ru_emotion_dvach`
+- **Эпохи**: 2
+- **Batch size**: 32
+- **LR**: 1e-5
+- **Mixed precision**: FP16
+- **Регуляризация**: Dropout 0.1, weight_decay 0.01, warmup_ratio 0.1
+### Зависимости
+- `transformers>=4.30.0`
+- `torch>=1.10.0`
+- `datasets`
+- `evaluate`
+### Лицензия
+CC-BY-SA 4.0.
+### Цитирование
+```bibtex
+@article{rubert_tiny2_russian_emotion_sentiment,
+  title   = {Russian Emotion Sentiment Classification with RuBERT-tiny2},
+  author  = {Kostya165},
+  year    = {2024},
+  howpublished = {\url{https://huggingface.co/Kostya165/rubert_tiny2_russian_emotion_sentiment}}
+}
+```
+---
+## English
+# rubert_tiny2_russian_emotion_sentiment
+**Description**
+The `rubert_tiny2_russian_emotion_sentiment` model is a fine‑tuned version of the lightweight [`cointegrated/rubert-tiny2`](https://huggingface.co/cointegrated/rubert-tiny2) for classifying five emotions in Russian text:
+- **0**: aggression
+- **1**: anxiety
+- **2**: neutral
+- **3**: positive
+- **4**: sarcasm
+**Validation Results**
+| Metric     | Value  |
+|------------|--------|
+| Accuracy   | 0.8911 |
+| F1 macro   | 0.8910 |
+| F1 micro   | 0.8911 |
+**Per‑class accuracy**:
+- aggression: 0.9120
+- anxiety:    0.9462
+- neutral:    0.8663
+- positive:   0.8884
+- sarcasm:    0.8426
+**Usage**
+```bash
+pip install transformers torch
+```
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+MODEL_ID = "Kostya165/rubert_tiny2_russian_emotion_sentiment"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+model     = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+model.eval()
+texts = ["Сегодня отличный день!", "Меня это всё бесит и раздражает."]
+enc = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**enc).logits
+    preds = logits.argmax(dim=-1).tolist()
+labels = [model.config.id2label[p] for p in preds]
+print(labels)  # e.g. ['positive', 'aggression']
+```
+**Training Details**
+- Base: `cointegrated/rubert-tiny2`
+- Dataset: `Kostya165/ru_emotion_dvach` (train/validation)
+- Epochs: 2
+- Batch size: 32
+- Learning rate: 1e‑5
+- Mixed precision: FP16
+- Regularization: Dropout 0.1, weight_decay 0.01, warmup_ratio 0.1
+**Requirements**
+- `transformers>=4.30.0`
+- `torch>=1.10.0`
+- `datasets`
+- `evaluate`
+**License**
+CC-BY-SA 4.0.
+**Citation**
+```bibtex
+@article{rubert_tiny2_russian_emotion_sentiment,
+  title   = {Russian Emotion Sentiment Classification with RuBERT-tiny2},
+  author  = {Kostya165},
+  year    = {2024},
+  howpublished = {\url{https://huggingface.co/Kostya165/rubert_tiny2_russian_emotion_sentiment}}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "emb_size": 312,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 312,
+  "id2label": {
+    "0": "aggression",
+    "1": "anxiety",
+    "2": "neutral",
+    "3": "positive",
+    "4": "sarcasm"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 600,
+  "label2id": {
+    "aggression": 0,
+    "anxiety": 1,
+    "neutral": 2,
+    "positive": 3,
+    "sarcasm": 4
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 2048,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 3,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.50.0.dev0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 83828
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2e45121e908f3bf6885bdcf73dd0777749d35cf499154cc6740423347343030
+size 116787892

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 2048,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff