NLLB-350M-EN-KM-v1

Model Description

This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the proof-of-concept version (1 epoch) demonstrating the feasibility of the distillation approach.

Developed by: Chealyfey Vutha
Model type: Sequence-to-sequence transformer for machine translation
Language(s): English to Khmer (en → km)
License: CC-BY-NC 4.0
Base model: facebook/nllb-200-distilled-600M
Teacher model: facebook/nllb-200-1.3B
Parameters: 350M (42% reduction from 600M baseline)

Model Details

Architecture

Encoder layers: 3 (reduced from 12)
Decoder layers: 3 (reduced from 12)
Hidden size: 1024
Attention heads: 16
Total parameters: ~350M

Training Procedure

Distillation method: Temperature-scaled knowledge distillation
Teacher model: NLLB-200-1.3B
Temperature: 5.0
Lambda (loss weighting): 0.5
Training epochs: 1 (proof of concept)
Training data: 316,110 English-Khmer pairs (generated via DeepSeek API)
Hardware: NVIDIA A100-SXM4-80GB

Intended Uses

Direct Use

This model is intended for:

English-to-Khmer translation tasks
Research on knowledge distillation for low-resource languages
Proof-of-concept demonstrations
Computational efficiency research

Downstream Use

Integration into translation applications
Fine-tuning for domain-specific translation
Baseline for further model compression research

How to Get Started with the Model


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)

# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)

# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)

Training Details

Training Data

Dataset size: 316,110 English-Khmer sentence pairs
Data source: Synthetic data generated using DeepSeek translation API
Preprocessing: Tokenized using NLLB-200 tokenizer with max length 128

Training Hyperparameters

Batch size: 48
Learning rate: 3e-5
Optimizer: AdamW
LR scheduler: Cosine
Training epochs: 1
Hardware: NVIDIA A100-SXM4-80GB with CUDA 12.8

Evaluation

Testing Data

The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs.

Metrics

Metric	Value
chrF Score	21.3502
BERTScore F1	0.8983

Results

This proof-of-concept model demonstrates that knowledge distillation can achieve reasonable translation quality with significantly reduced parameters (350M vs 600M baseline).

Limitations and Bias

Limitations

Limited training: Only 1 epoch of training; performance may improve with extended training
Synthetic data: Training data generated via API may not capture all linguistic nuances
Domain specificity: Performance may vary across different text domains
Resource constraints: Optimized for efficiency over maximum quality

Bias Considerations

Training data generated via translation API may inherit biases from the source model
Limited evaluation on diverse Khmer dialects and registers
Potential cultural and contextual biases in translation choices

Citation

@misc{nllb350m_en_km_v1_2025, title={NLLB-350M-EN-KM-v1: Proof of Concept English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v1} }

Model Card Contact

For questions or feedback about this model card: [email protected]

lyfeyvutha
/

nllb_350M_en_km_v1