Glossa-llama

This is a fine-tuned version of meta-llama/Meta-Llama-3-1B-Instruct trained to translate sign language glosses (simplified representations of sign language) into fluent English sentences.

Model Description

Base Model: LLaMA 3.2 1B Instruct
Task: Sign2English translation (sequence generation)
Fine-tuning method: LoRA (Parameter-Efficient Fine-Tuning)
Trained using: Hugging Face PEFT, Transformers, and Colab
Training Data: Custom dataset with gloss-style input and natural English reference output.

Each input was formatted like:

[INST] YOU WANT GO DRINK AFTER DINNER? [/INST] Say, Jim, how about going for a few beers after dinner?

Uses

Direct Use

This model is designed for direct use in translating sign language glosses (textual representations of signs) into fluent English sentences. Potential users include:

Developers building sign-to-speech or sign-to-text applications
Accessibility researchers and educators
Students working on gesture-based NLP systems

Downstream Use

The model may be integrated into:

Sign language translation pipelines (e.g. after sign recognition or gesture classification)
Real-time assistive tools for Deaf/HoH communities
Educational tools for ASL/BSL learners

Out-of-Scope Use

This model is not suitable for:

Translating raw video of sign language (requires pre-processing via sign language recognition)
Legal, medical, or safety-critical contexts
High-stakes decisions based solely on sign interpretation
Non-gloss input (e.g., full English sentences)

Bias, Risks, and Limitations

This model:

May hallucinate or overly generalize responses for unfamiliar gloss inputs
Was trained on synthetic or simplified gloss–English pairs and may not capture real-world nuance
Is not aware of cultural or regional sign language variations (e.g., ASL vs. BSL)
Should not be used to replace qualified interpreters or for legally binding communication
The model sometimes generates overly verbose outputs, including repetitive or semantically redundant content.
To control this, we applied sentence-count-based truncation post-generation. Future versions could benefit from length-aware decoding strategies or additional fine-tuning.

Recommendations

Always review outputs manually before use in production or public settings.
Use this model as a drafting assistant, not a definitive translator.
Future work should combine this model with sign language recognition models (e.g., from video) for full translation pipelines.
Include human-in-the-loop review when deployed in accessibility applications.

How to Get Started with the Model

Use the following code to get started with the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "rrrr66254/llama3.2_1B_sign2eng_finetuned"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True)

prompt = "[INST] YOU WANT GO DRINK AFTER DINNER? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.5,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Controlling Output Length (Optional)

This model occasionally generates longer or more verbose outputs than expected. To ensure the number of sentences in the output matches the number of sentences in the input gloss, use the following helper functions:

import torch
import re

def count_sentences(text):
    # Count number of sentence-ending punctuation
    return len(re.findall(r'[.!?]', text))

def truncate_to_n_sentences(text, n):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return " ".join(sentences[:n]).strip()

def generate_response(prompt, max_new_tokens=100):
    full_prompt = f"[INST] {prompt.strip()} [/INST]"
    num_sentences_in_prompt = count_sentences(prompt)

    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=0.8,
        temperature=0.6,
        repetition_penalty=1.5,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove the prompt from the output
    if "[/INST]" in decoded:
        decoded = decoded.split("[/INST]")[-1].strip()

    # Truncate to same number of sentences as input
    return truncate_to_n_sentences(decoded, num_sentences_in_prompt)

This step is optional, but recommended if you want tighter control over output length and style.

Training Details

Training Data

The model was trained on a custom dataset of sign language glosses paired with natural English translations. Each entry was structured as:

{
  "text": "[INST] GLOSS SENTENCE HERE [/INST] Fluent English output"
}

Gloss-style inputs represent the spoken meaning of signs without detailed grammatical markers. The dataset simulates real-world sign-to-English mappings.

Training Procedure

The model was fine-tuned using LoRA (Low-Rank Adaptation) with Hugging Face peft. Instruction-tuning format ([INST] ... [/INST]) was preserved.

Preprocessing

During preprocessing:

Each example was split into input and output around [/INST]
Loss masking was applied to ignore the prompt portion during training
Max sequence length was set to 512 tokens

labels[:input_len] = [-100] * input_len

Training Hyperparameters

- Base model: Meta-LLaMA-3.2-1B-Instruct
- Method: LoRA (r=8, alpha=16, dropout=0.1)
- Epochs: 3
- Optimizer: AdamW
- Learning rate: 2e-4
- Batch size: 3
- Gradient accumulation steps: 4
- Warmup steps: 100
- Logging steps: 500
- Save strategy: per epoch
- Precision: fp16

Training was performed using the Hugging Face Trainer API in Colab.

Speeds, Sizes, Times

Approx. training time: ~45 minutes on A100 (depends on batch size and LoRA setup)
Checkpoint size: ~350MB (LoRA adapter only)
Final merged model size: ~4.6GB

Evaluation

This section describes how the model was evaluated and summarizes its performance across several metrics.

Testing Data, Factors & Metrics

Testing Data

Evaluation was conducted on a held-out set of 100 gloss–sentence pairs from the same distribution as the training data. Each example was formatted in [INST] ... [/INST] style, where the prompt is a gloss and the reference is the target English sentence.

Metrics

The following automated metrics were used to evaluate translation quality:

BLEU-1 to BLEU-4: n-gram precision scores for lexical overlap
ROUGE: Measures n-gram recall overlap (ROUGE-1/2) and longest common subsequence (ROUGE-L).
BERTScore: Measures semantic similarity using contextual embeddings from pre-trained BERT.

Outputs were sentence-truncated to match the number of sentences in the gloss prompt before evaluation to avoid verbosity bias.

Results

Metric	Score
BLEU-1	0.2935
BLEU-2	0.1992
BLEU-3	0.1486
BLEU-4	0.1041
ROUGE-1	0.5698
ROUGE-2	0.3675
ROUGE-L	0.5325
BERTScore Precision	0.4523
BERTScore Recall	0.3138
BERTScore F1	0.3810

The model's BLEU score reflects its tendency to generate semantically correct but lexically varied outputs. ROUGE and BERTScore better capture paraphrased or fluent responses.

Summary

The fine-tuned model demonstrates:

Moderate lexical overlap (BLEU), especially at unigram level (BLEU-1)
Strong structural and semantic match (ROUGE, BERTScore)
Tendency to over-generate, controlled using sentence-length truncation

It is suitable for use in sign language translation support tools, especially where human review or post-editing is part of the pipeline.

Model Card Authors

Dongjun Kim

Model Card Contact

[email protected]

rrrr66254
/

Glossa-llama