Glossa-llama
This is a fine-tuned version of meta-llama/Meta-Llama-3-1B-Instruct
trained to translate sign language glosses (simplified representations of sign language) into fluent English sentences.
Model Description
- Base Model: LLaMA 3.2 1B Instruct
- Task: Sign2English translation (sequence generation)
- Fine-tuning method: LoRA (Parameter-Efficient Fine-Tuning)
- Trained using: Hugging Face PEFT, Transformers, and Colab
- Training Data: Custom dataset with gloss-style input and natural English reference output.
Each input was formatted like:
[INST] YOU WANT GO DRINK AFTER DINNER? [/INST] Say, Jim, how about going for a few beers after dinner?
Uses
Direct Use
This model is designed for direct use in translating sign language glosses (textual representations of signs) into fluent English sentences. Potential users include:
Developers building sign-to-speech or sign-to-text applications
Accessibility researchers and educators
Students working on gesture-based NLP systems
Downstream Use
The model may be integrated into:
Sign language translation pipelines (e.g. after sign recognition or gesture classification)
Real-time assistive tools for Deaf/HoH communities
Educational tools for ASL/BSL learners
Out-of-Scope Use
This model is not suitable for:
Translating raw video of sign language (requires pre-processing via sign language recognition)
Legal, medical, or safety-critical contexts
High-stakes decisions based solely on sign interpretation
Non-gloss input (e.g., full English sentences)
Bias, Risks, and Limitations
This model:
May hallucinate or overly generalize responses for unfamiliar gloss inputs
Was trained on synthetic or simplified gloss–English pairs and may not capture real-world nuance
Is not aware of cultural or regional sign language variations (e.g., ASL vs. BSL)
Should not be used to replace qualified interpreters or for legally binding communication
The model sometimes generates overly verbose outputs, including repetitive or semantically redundant content.
To control this, we applied sentence-count-based truncation post-generation. Future versions could benefit from length-aware decoding strategies or additional fine-tuning.
Recommendations
Always review outputs manually before use in production or public settings.
Use this model as a drafting assistant, not a definitive translator.
Future work should combine this model with sign language recognition models (e.g., from video) for full translation pipelines.
Include human-in-the-loop review when deployed in accessibility applications.
How to Get Started with the Model
Use the following code to get started with the model:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "rrrr66254/llama3.2_1B_sign2eng_finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)
prompt = "[INST] YOU WANT GO DRINK AFTER DINNER? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.5,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Controlling Output Length (Optional)
This model occasionally generates longer or more verbose outputs than expected. To ensure the number of sentences in the output matches the number of sentences in the input gloss, use the following helper functions:
import torch
import re
def count_sentences(text):
# Count number of sentence-ending punctuation
return len(re.findall(r'[.!?]', text))
def truncate_to_n_sentences(text, n):
sentences = re.split(r'(?<=[.!?])\s+', text)
return " ".join(sentences[:n]).strip()
def generate_response(prompt, max_new_tokens=100):
full_prompt = f"[INST] {prompt.strip()} [/INST]"
num_sentences_in_prompt = count_sentences(prompt)
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=0.8,
temperature=0.6,
repetition_penalty=1.5,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Remove the prompt from the output
if "[/INST]" in decoded:
decoded = decoded.split("[/INST]")[-1].strip()
# Truncate to same number of sentences as input
return truncate_to_n_sentences(decoded, num_sentences_in_prompt)
This step is optional, but recommended if you want tighter control over output length and style.
Training Details
Training Data
The model was trained on a custom dataset of sign language glosses paired with natural English translations. Each entry was structured as:
{
"text": "[INST] GLOSS SENTENCE HERE [/INST] Fluent English output"
}
Gloss-style inputs represent the spoken meaning of signs without detailed grammatical markers. The dataset simulates real-world sign-to-English mappings.
Training Procedure
The model was fine-tuned using LoRA (Low-Rank Adaptation) with Hugging Face peft. Instruction-tuning format ([INST] ... [/INST]) was preserved.
Preprocessing
During preprocessing:
Each example was split into input and output around [/INST]
Loss masking was applied to ignore the prompt portion during training
Max sequence length was set to 512 tokens
labels[:input_len] = [-100] * input_len
Training Hyperparameters
- Base model: Meta-LLaMA-3.2-1B-Instruct
- Method: LoRA (r=8, alpha=16, dropout=0.1)
- Epochs: 3
- Optimizer: AdamW
- Learning rate: 2e-4
- Batch size: 3
- Gradient accumulation steps: 4
- Warmup steps: 100
- Logging steps: 500
- Save strategy: per epoch
- Precision: fp16
Training was performed using the Hugging Face Trainer API in Colab.
Speeds, Sizes, Times
Approx. training time: ~45 minutes on A100 (depends on batch size and LoRA setup)
Checkpoint size: ~350MB (LoRA adapter only)
Final merged model size: ~4.6GB
Evaluation
This section describes how the model was evaluated and summarizes its performance across several metrics.
Testing Data, Factors & Metrics
Testing Data
Evaluation was conducted on a held-out set of 100 gloss–sentence pairs from the same distribution as the training data. Each example was formatted in [INST] ... [/INST] style, where the prompt is a gloss and the reference is the target English sentence.
Metrics
The following automated metrics were used to evaluate translation quality:
BLEU-1 to BLEU-4: n-gram precision scores for lexical overlap
ROUGE: Measures n-gram recall overlap (ROUGE-1/2) and longest common subsequence (ROUGE-L).
BERTScore: Measures semantic similarity using contextual embeddings from pre-trained BERT.
Outputs were sentence-truncated to match the number of sentences in the gloss prompt before evaluation to avoid verbosity bias.
Results
Metric | Score |
---|---|
BLEU-1 | 0.2935 |
BLEU-2 | 0.1992 |
BLEU-3 | 0.1486 |
BLEU-4 | 0.1041 |
ROUGE-1 | 0.5698 |
ROUGE-2 | 0.3675 |
ROUGE-L | 0.5325 |
BERTScore Precision | 0.4523 |
BERTScore Recall | 0.3138 |
BERTScore F1 | 0.3810 |
The model's BLEU score reflects its tendency to generate semantically correct but lexically varied outputs. ROUGE and BERTScore better capture paraphrased or fluent responses.
Summary
The fine-tuned model demonstrates:
Moderate lexical overlap (BLEU), especially at unigram level (BLEU-1)
Strong structural and semantic match (ROUGE, BERTScore)
Tendency to over-generate, controlled using sentence-length truncation
It is suitable for use in sign language translation support tools, especially where human review or post-editing is part of the pipeline.
Model Card Authors
Dongjun Kim
Model Card Contact
Model tree for rrrr66254/Glossa-llama
Base model
meta-llama/Llama-3.2-1B