Khmer Homophone Corrector

A fine-tuned PrahokBART model specifically designed for correcting homophones in Khmer text. This model builds upon PrahokBART, a pre-trained sequence-to-sequence model for Khmer natural language generation, and addresses the unique challenges of Khmer language processing, including word boundary issues and homophone confusion.

Model Description

Intended Uses & Limitations

Intended Use Cases

  • Homophone Correction: Correcting commonly confused Khmer homophones in text
  • Educational Applications: Helping students learn proper Khmer spelling
  • Text Preprocessing: Improving text quality for downstream Khmer NLP tasks
  • Content Creation: Assisting writers in producing error-free Khmer content

Limitations

  • Language Specific: Only works with Khmer text
  • Homophone Focus: Designed specifically for homophone correction, not general grammar or spelling
  • Context Dependency: May require surrounding context for optimal corrections
  • Training Data Scope: Limited to the homophone pairs in the training dataset

Training and Evaluation Data

Training Data

  • Dataset: Custom Khmer homophone dataset
  • Size: 268+ homophone groups
  • Coverage: Common Khmer homophones across different word categories
  • Preprocessing: Word segmentation using Khmer NLP tools
  • Format: JSON with input-target pairs

Evaluation Data

  • Test Set: Homophone pairs not seen during training
  • Metrics: BLEU score, WER, and human evaluation
  • Validation: Cross-validation on homophone groups

Data Preprocessing

  1. Word Segmentation: Using Khmer word tokenization (khmer_nltk.word_tokenize)
  2. Text Normalization: Standardizing text format with special tokens
  3. Special Tokens: Adding </s> <2km> for input and <2km> ... </s> for target
  4. Sequence Format: Converting to sequence-to-sequence format
  5. Padding: Max length 128 tokens with padding

Training Results

Performance Metrics

  • BLEU-1 Score: 99.5398
  • BLEU-2 Score: 99.162
  • BLEU-3 Score: 98.8093
  • BLEU-4 Score: 98.4861
  • WER (Word Error Rate): 0.008
  • Human Evaluation Score: 0.008
  • Final Training Loss: 0.0091
  • Final Validation Loss: 0.023525

Training Analysis

The model demonstrates exceptional performance and training characteristics:

  • Rapid Convergence: Training loss decreased dramatically from 0.6786 in epoch 1 to 0.0091 in epoch 40, showing excellent learning progression
  • Stable Validation: Validation loss stabilized around 0.023 after epoch 15, indicating consistent generalization performance
  • Outstanding Accuracy: Achieved exceptional BLEU scores with BLEU-1 reaching 99.54% and BLEU-4 at 98.49%, demonstrating near-perfect homophone correction
  • Minimal Error Rate: WER of 0.008 indicates extremely low word error rate, making the model highly reliable for practical applications
  • No Overfitting: The small and consistent gap between training (0.0091) and validation loss (0.0235) suggests excellent generalization without overfitting
  • Early Performance: Remarkably, the model achieved its best BLEU scores and WER as early as epoch 1, indicating the effectiveness of the PrahokBART base model for Khmer homophone correction

Training Configuration

  • Base Model: PrahokBART (from nict-astrec-att/prahokbart_big)
  • Model Architecture: PrahokBART (Khmer-specific BART variant)
  • Training Framework: Hugging Face Transformers
  • Optimizer: AdamW
  • Learning Rate: 3e-5
  • Batch Size: 32 (per device)
  • Training Epochs: 40
  • Warmup Ratio: 0.1
  • Weight Decay: 0.01
  • Mixed Precision: FP16 enabled
  • Evaluation Strategy: Every epoch
  • Save Strategy: Every epoch (best 2 checkpoints)
  • Max Sequence Length: 128 tokens
  • Resume Training: Supported with checkpoint management

Usage

Basic Usage

from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "socheatasokhachan/khmerhomophonecorrector"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Example text with homophones
text = "αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž„αŸ‹αž“αžΌαžœαžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™"  # Input with homophone error

# Preprocess text (word segmentation)
from khmer_nltk import word_tokenize
segmented_text = " ".join(word_tokenize(text))

# Prepare input
input_text = f"{segmented_text} </s> <2km>"
inputs = tokenizer(
    input_text,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=1024,
    add_special_tokens=True
)

# Move to device
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate correction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=1024,
        num_beams=5,
        early_stopping=True,
        do_sample=False,
        no_repeat_ngram_size=3,
        forced_bos_token_id=32000,
        forced_eos_token_id=32001,
        length_penalty=1.0,
        temperature=1.0
    )

# Decode output
corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
corrected = corrected.replace("</s>", "").replace("<2km>", "").replace("β–‚", " ").strip()

print(f"Original: {text}")
print(f"Corrected: {corrected}")
# Expected output: αžαŸ’αž‰αž»αŸ†αž€αŸ†αž–αž»αž„αž“αŸ…αžŸαž€αž›αžœαž·αž‘αŸ’αž™αžΆαž›αŸαž™

Using with Streamlit

import streamlit as st
from transformers import MBartForConditionalGeneration, AutoTokenizer

@st.cache_resource
def load_model():
    model = MBartForConditionalGeneration.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    tokenizer = AutoTokenizer.from_pretrained("socheatasokhachan/khmerhomophonecorrector")
    return model, tokenizer

# Load model
model, tokenizer = load_model()

# Streamlit interface
st.title("Khmer Homophone Corrector")
user_input = st.text_area("Enter Khmer text:")
if st.button("Correct"):
    # Process text and display results

Model Architecture

  • Base Model: PrahokBART (Khmer-specific BART variant)
  • Architecture: Sequence-to-Sequence Transformer
  • Max Sequence Length: 128 tokens
  • Special Features: Khmer word segmentation and normalization
  • Tokenization: SentencePiece with Khmer-specific preprocessing

Citation

If you use this model in your research, please cite:

@misc{sokhachan2025khmerhomophonecorrector,
  title={Khmer Homophone Corrector: A Fine-tuned PrahokBART Model for Khmer Text Correction},
  author={Socheata Sokhachan},
  year={2024},
  url={https://huggingface.co/socheatasokhachan/khmerhomophonecorrector}
}

Related Research

This model builds upon and fine-tunes the PrahokBART model:

PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Acknowledgments

  • The PrahokBART research team for the base model
  • Hugging Face for the transformers library
  • The Khmer NLP community for language resources
  • Streamlit for the web framework
  • Contributors to the Khmer language processing tools

Note: This model is specifically designed for Khmer language homophone correction and may not work optimally with other languages or tasks.

Downloads last month
14
Safetensors
Model size
211M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for socheatasokhachan/khmerhomophonecorrector

Finetuned
(1)
this model