Intelligent Tokenizer v6.0 - Language Pattern Learning

Model Description

World's First Language Pattern Learning Tokenizer - Discovers each language's unique patterns through pure learning.

Key Features

  • No vocabulary files - Only 260 fixed byte values
  • Language pattern discovery - Learns Korean particles, English morphology, Chinese characters
  • Equal language processing - No English bias
  • Semantic unit preservation - Keeps meaning units intact

Performance (Epoch 22)

Language Group Accuracy
English/European 95-100%
Korean 70%
Japanese 81%
Chinese 7% (still learning)
Rare Languages 47% avg

Technical Details

  • Architecture: 5-layer Encoder + 6-layer Decoder
  • Parameters: 105M
  • Input: Raw UTF-8 bytes
  • Output: Compressed semantic units
  • Training: 22 epochs on Flores-200 dataset

Usage

from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("woo-jinhyun/intelligent-tokenizer-v6")
tokenizer = ByteTokenizerV6()  # Custom tokenizer

# Process text
text = "์•ˆ๋…•ํ•˜์„ธ์š”"
encoded = tokenizer.encode(text)
compressed = model.encode(encoded)

Limitations

  • Current chunk size: 256 bytes (POC limitation)
  • Chinese/Arabic need more training
  • Compression still learning

Citation

@software{intelligent_tokenizer_2025,
  author = {Woo, Jinhyun and Claude Code},
  title = {Intelligent Tokenizer: Language Pattern Learning},
  year = {2025},
  url = {https://github.com/Woojiggun/intelligent-tokenizer}
}

Contact

Development

  • Design: Woo Jinhyun
  • Implementation: Claude Code collaboration
  • Hardware: RTX 4070
  • Duration: 1 months (Aug-Sep 2025)
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • Character Accuracy (Major Languages) on flores200
    self-reported
    0.623
  • Character Accuracy (Minor Languages) on flores200
    self-reported
    0.472