|
--- |
|
license: apache-2.0 |
|
base_model: llama |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- one-way-polyglot |
|
- japanese |
|
- english |
|
- bilingual |
|
- small-model |
|
--- |
|
|
|
# one-way-polyglot-12m-untied |
|
|
|
A one-way polyglot language model trained to understand Japanese but generate only English. |
|
|
|
## Model Details |
|
|
|
- **Architecture**: LLaMA-based transformer |
|
- **Parameters**: 12,714,240 (12.7M) |
|
- **Vocabulary**: 16,384 tokens (bilingual SentencePiece) |
|
- **Context Length**: 512 tokens |
|
- **Embedding Strategy**: Untied |
|
|
|
## Capabilities |
|
|
|
- **Semantic Transfer**: Understands Japanese input and generates contextually appropriate English |
|
- **One-Way Constraint**: Strong bias toward English-only generation |
|
- **Name Transliteration**: Can transliterate Japanese names to English (context-dependent) |
|
|
|
## Training Data |
|
|
|
Trained on bilingual Japanese-English story data with masked loss on Japanese prefixes to enforce one-way generation. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import LlamaForCausalLM, AutoTokenizer |
|
|
|
model = LlamaForCausalLM.from_pretrained("one-way-polyglot-12m-untied") |
|
tokenizer = AutoTokenizer.from_pretrained("one-way-polyglot-12m-untied") |
|
|
|
# Japanese input β English output (primary use case) |
|
prompt = "ζγ
γθ΅€γεγζγ£γε°ε₯³γγγΎγγγ" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
# Mixed-language name transliteration |
|
prompt = "ε€ͺιγ―ε
¬εγ§θ±εγ¨ιγγ§γγΎγγγAfter playing, Taro told Hanako that" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
# English text (works perfectly with case folding) |
|
prompt = "Hello World" # Automatically normalized to lowercase |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
### Tokenizer Features |
|
- **β
Case Folding**: "Hello", "hello", and "HELLO" produce identical tokenization |
|
- **β
Japanese Support**: Full Japanese text support with proper normalization |
|
- **β
No UNK Tokens**: Proper handling of uppercase/lowercase English text |
|
- **β
SentencePiece Compatibility**: Built using proper Unigram model with normalization |
|
|
|
## Model Variants |
|
|
|
This is part of a series exploring one-way polyglot capabilities: |
|
- 1.25M parameters (tied embeddings) |
|
- 8.5M parameters (tied embeddings) |
|
- 12.7M parameters (untied embeddings) |
|
- 15.7M parameters (tied embeddings) |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|