Intelligent Tokenizer v6.0 - Language Pattern Learning
Model Description
World's First Language Pattern Learning Tokenizer - Discovers each language's unique patterns through pure learning.
Key Features
- No vocabulary files - Only 260 fixed byte values
- Language pattern discovery - Learns Korean particles, English morphology, Chinese characters
- Equal language processing - No English bias
- Semantic unit preservation - Keeps meaning units intact
Performance (Epoch 22)
| Language Group | Accuracy |
|---|---|
| English/European | 95-100% |
| Korean | 70% |
| Japanese | 81% |
| Chinese | 7% (still learning) |
| Rare Languages | 47% avg |
Technical Details
- Architecture: 5-layer Encoder + 6-layer Decoder
- Parameters: 105M
- Input: Raw UTF-8 bytes
- Output: Compressed semantic units
- Training: 22 epochs on Flores-200 dataset
Usage
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("woo-jinhyun/intelligent-tokenizer-v6")
tokenizer = ByteTokenizerV6() # Custom tokenizer
# Process text
text = "์๋
ํ์ธ์"
encoded = tokenizer.encode(text)
compressed = model.encode(encoded)
Limitations
- Current chunk size: 256 bytes (POC limitation)
- Chinese/Arabic need more training
- Compression still learning
Citation
@software{intelligent_tokenizer_2025,
author = {Woo, Jinhyun and Claude Code},
title = {Intelligent Tokenizer: Language Pattern Learning},
year = {2025},
url = {https://github.com/Woojiggun/intelligent-tokenizer}
}
Contact
- Author: Woo Jinhyun
- Email: [email protected]
- LinkedIn: www.linkedin.com/in/namuneup
- Paper: Read on Zenodo
Development
- Design: Woo Jinhyun
- Implementation: Claude Code collaboration
- Hardware: RTX 4070
- Duration: 1 months (Aug-Sep 2025)
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Evaluation results
- Character Accuracy (Major Languages) on flores200self-reported0.623
- Character Accuracy (Minor Languages) on flores200self-reported0.472