Turkish BPE Tokenizer

Model Description

This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for Turkish language. It's designed to be used in Turkish language model pretraining and NLP tasks.

Key Features:

  • Trained on a diverse Turkish text corpus
  • 50,000 vocabulary size
  • min_frequency = 2
  • Special handling for Turkish characters (ğ, ü, ş, ı, ö, ç)
  • Added special tokens:[[EOS],[SEP],[UNK], [MASK],[PAD]]

Intended Uses & Limitations

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("abakirci/admbkrc-turkish-tokenizer")

text = "Bu bir Türkçe örnek cümledir."
encoded = tokenizer(text, return_tensors="pt")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train abakirci/admbkrc-turkish-tokenizer