--- language: - vi tags: - translation license: mit widget: - text: "𡦂才𡦂命窖󰑼恄饒" inference: parameters: max_length: 48 pipeline_tag: translation library_name: transformers --- # Bidirectional Vietnamese Nôm Transliteration Vietnamese Nôm, or Chữ Nôm, was an ancient writing system in Vietnam before the 20th century. It evolved from Chinese characters but adapted to Vietnamese sounds and vocabulary. Nôm was used by scholars for literature and communication. The script visually differed from Chinese characters and expressed Vietnamese concepts with semantic and phonetic components. Today, Nôm is a specialized field, and efforts are made to preserve its knowledge. Though modern Vietnamese uses the Latin alphabet, Nôm remains an integral part of Vietnam's cultural heritage. ## State-of-the-art lightweights pretrained Transformer-based encoder-decoder model for Vietnamese Nom translation. Model trained on dataset Luc-Van- Tien’s book, Tale Of Kieu book, “History of Greater Vietnam” book, “Chinh Phu Ngam Khuc” poems, “Ho Xuan Huong” poems, Corpus documents from chunom.org, sample texts coming from 130 different books (Tu-Dien-Chu-Nom-Dan Giai). ## The model is trained and supports bidirectional translation between Vietnamese Nôm script and Vietnamese Latin script, enabling the translation from Nôm to Vietnamese Latin script and vice versa. ## How to use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-nom") model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-nom") model.cuda() src = "如梅早杏遲管" tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda() model.eval() translate_ids = model.generate(tokenized_text, max_length=48) output = tokenizer.decode(translate_ids[0], skip_special_tokens=True) output ``` 'như mai tảo hạnh trì quán' ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-nom") model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-nom") model.cuda() src = "như mai tảo hạnh trì quán" tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda() model.eval() translate_ids = model.generate(tokenized_text, max_length=48) output = tokenizer.decode(translate_ids[0], skip_special_tokens=True) output ``` '如梅早杏遲舘' ## Author ` Phan Minh Toan `