ayushsinha's picture
Update README.md
6382f64 verified
|
raw
history blame
2.04 kB

Model Overview

This model is a fine-tuned version of the Helsinki-NLP OPUS-MT model for multiple language pairs. It has been fine-tuned on the Tatoeba dataset for the following language pairs:

English to Marathi (en-mr)

Esperanto to Dutch (eo-nl)

Spanish to Portuguese (es-pt)

French to Russian (fr-ru)

Spanish to Galician (es-gl)

The model supports sequence-to-sequence translation and has been optimized for performance using FP16 quantization.

Model Details

Base Model: Helsinki-NLP/opus-mt-en-roa

Training Dataset: Tatoeba dataset

Fine-tuned Language Pairs: en-mr, eo-nl, es-pt, fr-ru, es-gl

Evaluation Metric: BLEU Score (using sacreBLEU)

Training Framework: Hugging Face Transformers

Training Configuration

Optimizer: AdamW

Learning Rate: 2e-5

Batch Size: 16 (per device)

Weight Decay: 0.01

Epochs: 3

Precision: FP32 (initial training), converted to FP16 for inference

Quantization and FP16 Conversion

To improve inference efficiency, models were converted to FP16:

import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

Inference Example

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "AventIQ-AI/opus-mt-en-roa_multilanguageTranslation"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Usage

The models can be used for translation tasks in various NLP applications, including chatbots, document translation, and real-time communication.

Limitations

May not generalize well for domain-specific text.

FP16 quantization may lead to minor loss in precision.

Translation accuracy depends on the dataset quality.

Citation

If you use this model, please cite the original OPUS-MT paper and acknowledge the fine-tuning process conducted using the Tatoeba dataset.