ArlowGPT Tokenizer

Overview

The ArlowGPT Tokenizer is a byte pair encoding (BPE) tokenizer developed from scratch, optimized for large-scale language modeling and text generation tasks. It features a vocabulary size of 59,575 tokens and supports a maximum context length of 131,072 tokens, making it suitable for handling extremely long documents and sequences.

Key Features

  • Vocabulary Size: 59,575 tokens
  • Maximum Context Length: 131,072 tokens
  • Tokenizer Type: Byte Pair Encoding (BPE)
  • Special Tokens:
    • <pad>: Padding token used for sequence alignment.
    • <mask>: Special token for masked language modeling tasks.
    • <eos>: End-of-sequence token.
    • <bos>: Beginning-of-sequence token.
  • Trained From Scratch: The tokenizer was trained from scratch using a large corpus of English and multilingual text.

Training Data

The tokenizer was trained on Wikipedia, ensuring high coverage of general knowledge and domain-specific terms. Although primarily optimized for English, it also includes some multilingual capability due to the nature of the training dataset.

Intended Use Cases

This tokenizer is designed for general-purpose language modeling and is suitable for tasks such as:

  • Autoregressive text generation
  • Long-context summarization
  • Conversational AI
  • Information retrieval over large documents
  • General NLP tasks requiring long context processing

Supported Languages

  • Primary Language: English
  • Secondary Support: Some multilingual content

Performance & Benchmarks

No formal benchmarks have been conducted yet, but the tokenizer has been designed for efficiency in both tokenization speed and memory usage, with a focus on handling extremely long contexts up to 131,072 tokens.

Limitations

  • Multilingual Coverage: While the tokenizer includes some multilingual tokens, it is primarily optimized for English text, and performance on non-English languages may vary.
  • No Benchmarked Metrics: The tokenizer has not undergone formal benchmarking for speed or performance across various tasks.

Citation

If you use the ArlowGPT Tokenizer in your work, please cite it as:

@misc{arlowgpt_tokenizer,
  title={ArlowGPT Tokenizer},
  author={yuchenxie},
  year={2025},
  howpublished={\url{https://huggingface.co/yuchenxie/ArlowGPT-Tokenizer}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for yuchenxie/ArlowGPT-Tokenizer

Finetunes
1 model