keenanpepper
/

one-way-polyglot-12m-untied

Text Generation

one-way-polyglot

text-generation-inference

Model card Files Files and versions

one-way-polyglot-12m-untied / README.md

keenanpepper's picture

Upload folder using huggingface_hub

e954088 verified 2 months ago

|

history blame contribute delete

2.71 kB

	---
	license: apache-2.0
	base_model: llama
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- one-way-polyglot
	- japanese
	- english
	- bilingual
	- small-model
	---

	# one-way-polyglot-12m-untied

	A one-way polyglot language model trained to understand Japanese but generate only English.

	## Model Details

	- Architecture: LLaMA-based transformer
	- Parameters: 12,714,240 (12.7M)
	- Vocabulary: 16,384 tokens (bilingual SentencePiece)
	- Context Length: 512 tokens
	- Embedding Strategy: Untied

	## Capabilities

	- Semantic Transfer: Understands Japanese input and generates contextually appropriate English
	- One-Way Constraint: Strong bias toward English-only generation
	- Name Transliteration: Can transliterate Japanese names to English (context-dependent)

	## Training Data

	Trained on bilingual Japanese-English story data with masked loss on Japanese prefixes to enforce one-way generation.

	## Usage

	```python
	from transformers import LlamaForCausalLM, AutoTokenizer

	model = LlamaForCausalLM.from_pretrained("one-way-polyglot-12m-untied")
	tokenizer = AutoTokenizer.from_pretrained("one-way-polyglot-12m-untied")

	# Japanese input → English output (primary use case)
	prompt = "昔々、赤い傘を持った少女がいました。"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	# Mixed-language name transliteration
	prompt = "太郎は公園で花子と遊んでいました。After playing, Taro told Hanako that"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	# English text (works perfectly with case folding)
	prompt = "Hello World" # Automatically normalized to lowercase
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Tokenizer Features
	- ✅ Case Folding: "Hello", "hello", and "HELLO" produce identical tokenization
	- ✅ Japanese Support: Full Japanese text support with proper normalization
	- ✅ No UNK Tokens: Proper handling of uppercase/lowercase English text
	- ✅ SentencePiece Compatibility: Built using proper Unigram model with normalization

	## Model Variants

	This is part of a series exploring one-way polyglot capabilities:
	- 1.25M parameters (tied embeddings)
	- 8.5M parameters (tied embeddings)
	- 12.7M parameters (untied embeddings)
	- 15.7M parameters (tied embeddings)

	## License

	Apache 2.0