TaiPhone: A Phone-Scale LLM Rooted in Taiwanese Knowledge
TaiPhone is a low-cost, lightweight language model built for Traditional Chinese, with a strong focus on Taiwanese language, culture, and context. Trained on just 0.7 billion carefully curated tokens and enhanced with chat vector techniques, TaiPhone delivers superior performance compared to similarly sized open-source LLaMA-tuned 1B or 3B-scale LLMs. TaiPhone shows that with the right data, effective and culturally-aware models can be built at a fraction of the cost.
Model Information
- Base model: https://huggingface.co/meta-llama/Llama-3.2-3B
- Context length: 16k
- Training detail:
- Numbers of tokens: 0.7B tokens
- Continual pretraining(CP) epochs: 2
- Fine-tuning(FT) epochs: 3
- CP learning rate: 5e-5 with cosine scheduler.
- FT learning rate: 1e-5 with cosine scheduler.
Training Process
We first curate a set of high-quality articles to serve as continual pretraining data for model training. The resulting model is referred to as the CP model. Next, we enhance the CP model with conversational abilities using a technique called chat vector. This approach grants the model a certain level of chat capability without the need for preference alignment methods like DPO or RLHF, and it also helps maintain strong English proficiency.
However, this chat model has some limitations—for example, it occasionally mixes Chinese and English inappropriately.To address this, we prepare an additional set of finetuning data to train the final version of the model, referred to as the Final Model.
Finally, we conduct evaluations to refine both the CP corpus and FT corpus used earlier in the process.
Benchmark
- Evalaution code can be found here: https://github.com/aqweteddy/TaiphoneEval
MCQ Evaluation
- The model is prompted to answer each multiple-choice question in free-form, without being constrained to a specific format.
- A lightweight LLM (e.g., GPT-4.1-nano) is then used to extract the model’s final selected option from its response.
- Accuracy is calculated by comparing the extracted answers against the correct choices.
Score Board
- 1B Scale
Model | TW-MCQ | MMLU-Redux |
---|---|---|
LLaMA3.2-1B-Instruct | 0.305 | 0.403 |
LLaMA3.2-1B-it-chinese-kyara | 0.360 | 0.405 |
LLaMA3.2-TaiPhone-1B-Instruct-v0.1 (Ours) | 0.375 | 0.421 |
- 3B Scale
Model | TW-MCQ | MMLU-Redux |
---|---|---|
LLaMA3.2-3B-Instruct | 0.442 | 0.569 |
LLaMA3.2-3B-it-chinese-kyara | 0.462 | 0.405 |
Llama-3.2-3B-F1-Instruct | 0.458 | 0.548 |
LLaMA3.2-TaiPhone-3B-Instruct-v0.1 (Ours) | 0.502 | 0.578 |
LLaMA3.2-TaiPhone-3B-Instruct-v0.2 (Ours) | 0.515 | 0.557 |
- TW-MCQ: aqweteddy/Taiwan-Curlture-MCQ
- MMLU-Redux: https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux https://huggingface.co/datasets/aqweteddy/MMLU-Redux-MCQ
MT-Bench-Zhtw
LLM as a Judge
- Dataset source
- Evaluation focused on multiple aspects of conversational performance.
Score Board
- 3B Scale
- Llama-3.2-TaiPhone-3B-Instruct-v0.2 has better performance in Roleplay and Extraction tasks.
Model | writing | roleplay | reasoning | math | coding | extraction | stem | humanities |
---|---|---|---|---|---|---|---|---|
Llama-3.2-3B-Instruct | 4.2 | 3.9 | 4.1 | 4.3 | 4.9 | 3.8 | 4.0 | 4.3 |
Llama-3.2-3B-F1-Instruct | 5.5 | 6.9 | 4.2 | 3.9 | 3.8 | 4.7 | 5.2 | 7.6 |
Llama-3.2-Kyara-3B-it | 5.7 | 7.2 | 4.8 | 6.3 | 5.2 | 5.3 | 5.9 | 7.5 |
Llama-3.2-TaiPhone-3B-Instruct-v0.1 (Ours) | 5.5 | 5.8 | 4.9 | 5.0 | 5.0 | 3.8 | 4.5 | 7.3 |
Llama-3.2-TaiPhone-3B-Instruct-v0.2 (Ours) | 5.0 | 6.7 | 4.5 | 4.0 | 5.0 | 5.2 | 5.0 | 7.7 |
- Downloads last month
- 62
Model tree for aqweteddy/Llama3.2-TaiPhone-3B-Instruct-v0.2
Base model
meta-llama/Llama-3.2-3B