Initial multilingual model deployment

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +98 -0
model.pt +3 -0
model_config.json +6 -0
processor_config.json +7 -0
special_tokens_map.json +15 -0
tokenizer.json +3 -0
tokenizer_config.json +55 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,98 @@

+---
+language: en
+tags:
+- text-classification
+- sentiment-analysis
+- transformers
+- pytorch
+- multilingual
+license: mit
+---
+# advexon/multilingual-sentiment-classifier
+Multilingual text classification model trained on XLM-RoBERTa base for sentiment analysis across English, Russian, Tajik and other languages
+## Model Description
+This is a multilingual text classification model based on XLM-RoBERTa. It has been trained for sentiment analysis across multiple languages and can classify text into positive, negative, and neutral categories.
+## Model Details
+- **Base Model**: XLM-RoBERTa Base
+- **Number of Labels**: 3 (Positive, Negative, Neutral)
+- **Languages**: Multilingual (English, Russian, Tajik, and others)
+- **Max Sequence Length**: 512 tokens
+## Performance
+Based on training metrics:
+- **Training Accuracy**: 58.33%
+- **Validation Accuracy**: 100%
+- **Training Loss**: 0.94
+- **Validation Loss**: 0.79
+## Usage
+### Using the Model
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load the model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("advexon/multilingual-sentiment-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("advexon/multilingual-sentiment-classifier")
+# Example usage
+text = "This product is amazing!"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+outputs = model(**inputs)
+predictions = torch.softmax(outputs.logits, dim=-1)
+predicted_class = torch.argmax(predictions, dim=1).item()
+# Class mapping: 0=Negative, 1=Neutral, 2=Positive
+sentiment_labels = ["Negative", "Neutral", "Positive"]
+predicted_sentiment = sentiment_labels[predicted_class]
+print(f"Predicted sentiment: {predicted_sentiment}")
+```
+### Example Predictions
+- "I absolutely love this product!" → Positive
+- "This is terrible quality." → Negative
+- "It's okay, nothing special." → Neutral
+- "Отличный сервис!" → Positive (Russian)
+- "Хунуки хуб нест" → Negative (Tajik)
+## Training
+This model was trained using:
+- **Base Model**: XLM-RoBERTa Base
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Batch Size**: 16
+- **Training Epochs**: 2
+- **Languages**: English, Russian, Tajik
+## Limitations
+- The model's performance may vary across different languages
+- It is recommended to fine-tune on domain-specific data for optimal performance
+- Maximum input length is 512 tokens
+- Performance may be lower on languages not well-represented in the training data
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{multilingual-text-classifier,
+  title={Multilingual Text Classification Model},
+  author={Your Name},
+  year={2024},
+  publisher={Hugging Face},
+  journal={Hugging Face Hub},
+  howpublished={\url{https://huggingface.co/advexon/multilingual-sentiment-classifier}},
+}
+```

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6cd22b17644ac6c33208e21799831c9636fc015433179e5b8b3cd11b5e55ba66
+size 1113428295

model_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "model_name": "xlm-roberta-base",
+  "num_labels": 3,
+  "dropout_rate": 0.2,
+  "hidden_size": 768
+}

processor_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "tokenizer_name": "xlm-roberta-base",
+  "max_length": 512,
+  "truncation": true,
+  "padding": true,
+  "label_mapping": {}
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
+size 17082987

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}